A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file below.
The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.
The company wants to know:
Which variables are significant in predicting the price of a house, and
How well those variables describe the price of a house.
Also, determine the optimal value of lambda for ridge and lasso regression.
You are required to model the price of houses with the available independent variables. This model will then be used by the management to understand how exactly the prices vary with the variables. They can accordingly manipulate the strategy of the firm and concentrate on areas that will yield high returns. Further, the model will be a good way for management to understand the pricing dynamics of a new market.
1. Import libraries, Data
2. Perform EDA - Data cleanup, Preparation, Dummy variables etc
3. Reduce the features using RFE
4. Ridge Regression model
5. Lasso Regression model
# Import libraries
import numpy as np
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model, metrics
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import os
#to ignore Warnings
import warnings
warnings.filterwarnings('ignore')
#Reading the data
Data_train = pd.read_csv("train.csv")
Data_train.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
Data_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 1452 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
There are 1460 rows and 81 columns Unnessary column Id y -> Sales price
sales related columns:
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
High Missing Data colunms:
Alley, FirePlaceQu, PoolQC, Fence, MiscFearture.
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access to property
Grvl Gravel
Pave Paved
Alley: Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley access
LotShape: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
LandContour: Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low Depression
Utilities: Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
LotConfig: Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
LandSlope: Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
Neighborhood: Physical locations within Ames city limits
Blmngtn Bloomington Heights
Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards
Gilbert Gilbert
IDOTRR Iowa DOT and Rail Road
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook
Timber Timberland
Veenker Veenker
Condition1: Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
Condition2: Proximity to various conditions (if more than one is present)
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
BldgType: Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
HouseStyle: Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level
OverallQual: Rates the overall material and finish of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
OverallCond: Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
YearBuilt: Original construction date
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
RoofStyle: Type of roof
Flat Flat
Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed
RoofMatl: Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles
Exterior1st: Exterior covering on house
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Exterior2nd: Exterior covering on house (if more than one material)
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
MasVnrType: Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
MasVnrArea: Masonry veneer area in square feet
ExterQual: Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
ExterCond: Evaluates the present condition of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Foundation: Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood
BsmtQual: Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
BsmtCond: Evaluates the general condition of the basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
BsmtExposure: Refers to walkout or garden level walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
BsmtFinType1: Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Rating of basement finished area (if multiple types)
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnace
HeatingQC: Heating quality and condition
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
CentralAir: Central air conditioning
N No
Y Yes
Electrical: Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
Kitchen: Kitchens above grade
KitchenQual: Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality (Assume typical unless deductions are warranted)
Typ Typical Functionality
Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions
Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage only
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
GarageType: Garage location
2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
GarageCond: Garage condition
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
PavedDrive: Paved driveway
Y Paved
P Partial Pavement
N Dirt/Gravel
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
Fence: Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
MiscFeature: Miscellaneous feature not covered in other categories
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold (MM)
YrSold: Year Sold (YYYY)
SaleType: Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other
SaleCondition: Condition of sale
Normal Normal Sale
Abnorml Abnormal Sale - trade, foreclosure, short sale
AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed (associated with New Homes)
Data_train.describe()
| Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | ... | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1460.000000 | 1460.000000 | 1201.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1452.000000 | 1460.000000 | ... | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 |
| mean | 730.500000 | 56.897260 | 70.049958 | 10516.828082 | 6.099315 | 5.575342 | 1971.267808 | 1984.865753 | 103.685262 | 443.639726 | ... | 94.244521 | 46.660274 | 21.954110 | 3.409589 | 15.060959 | 2.758904 | 43.489041 | 6.321918 | 2007.815753 | 180921.195890 |
| std | 421.610009 | 42.300571 | 24.284752 | 9981.264932 | 1.382997 | 1.112799 | 30.202904 | 20.645407 | 181.066207 | 456.098091 | ... | 125.338794 | 66.256028 | 61.119149 | 29.317331 | 55.757415 | 40.177307 | 496.123024 | 2.703626 | 1.328095 | 79442.502883 |
| min | 1.000000 | 20.000000 | 21.000000 | 1300.000000 | 1.000000 | 1.000000 | 1872.000000 | 1950.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2006.000000 | 34900.000000 |
| 25% | 365.750000 | 20.000000 | 59.000000 | 7553.500000 | 5.000000 | 5.000000 | 1954.000000 | 1967.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 2007.000000 | 129975.000000 |
| 50% | 730.500000 | 50.000000 | 69.000000 | 9478.500000 | 6.000000 | 5.000000 | 1973.000000 | 1994.000000 | 0.000000 | 383.500000 | ... | 0.000000 | 25.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 2008.000000 | 163000.000000 |
| 75% | 1095.250000 | 70.000000 | 80.000000 | 11601.500000 | 7.000000 | 6.000000 | 2000.000000 | 2004.000000 | 166.000000 | 712.250000 | ... | 168.000000 | 68.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 2009.000000 | 214000.000000 |
| max | 1460.000000 | 190.000000 | 313.000000 | 215245.000000 | 10.000000 | 9.000000 | 2010.000000 | 2010.000000 | 1600.000000 | 5644.000000 | ... | 857.000000 | 547.000000 | 552.000000 | 508.000000 | 480.000000 | 738.000000 | 15500.000000 | 12.000000 | 2010.000000 | 755000.000000 |
8 rows × 38 columns
cor = Data_train.corr()
cor
| Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | ... | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Id | 1.000000 | 0.011156 | -0.010601 | -0.033226 | -0.028365 | 0.012609 | -0.012713 | -0.021998 | -0.050298 | -0.005024 | ... | -0.029643 | -0.000477 | 0.002889 | -0.046635 | 0.001330 | 0.057044 | -0.006242 | 0.021172 | 0.000712 | -0.021917 |
| MSSubClass | 0.011156 | 1.000000 | -0.386347 | -0.139781 | 0.032628 | -0.059316 | 0.027850 | 0.040581 | 0.022936 | -0.069836 | ... | -0.012579 | -0.006100 | -0.012037 | -0.043825 | -0.026030 | 0.008283 | -0.007683 | -0.013585 | -0.021407 | -0.084284 |
| LotFrontage | -0.010601 | -0.386347 | 1.000000 | 0.426095 | 0.251646 | -0.059213 | 0.123349 | 0.088866 | 0.193458 | 0.233633 | ... | 0.088521 | 0.151972 | 0.010700 | 0.070029 | 0.041383 | 0.206167 | 0.003368 | 0.011200 | 0.007450 | 0.351799 |
| LotArea | -0.033226 | -0.139781 | 0.426095 | 1.000000 | 0.105806 | -0.005636 | 0.014228 | 0.013788 | 0.104160 | 0.214103 | ... | 0.171698 | 0.084774 | -0.018340 | 0.020423 | 0.043160 | 0.077672 | 0.038068 | 0.001205 | -0.014261 | 0.263843 |
| OverallQual | -0.028365 | 0.032628 | 0.251646 | 0.105806 | 1.000000 | -0.091932 | 0.572323 | 0.550684 | 0.411876 | 0.239666 | ... | 0.238923 | 0.308819 | -0.113937 | 0.030371 | 0.064886 | 0.065166 | -0.031406 | 0.070815 | -0.027347 | 0.790982 |
| OverallCond | 0.012609 | -0.059316 | -0.059213 | -0.005636 | -0.091932 | 1.000000 | -0.375983 | 0.073741 | -0.128101 | -0.046231 | ... | -0.003334 | -0.032589 | 0.070356 | 0.025504 | 0.054811 | -0.001985 | 0.068777 | -0.003511 | 0.043950 | -0.077856 |
| YearBuilt | -0.012713 | 0.027850 | 0.123349 | 0.014228 | 0.572323 | -0.375983 | 1.000000 | 0.592855 | 0.315707 | 0.249503 | ... | 0.224880 | 0.188686 | -0.387268 | 0.031355 | -0.050364 | 0.004950 | -0.034383 | 0.012398 | -0.013618 | 0.522897 |
| YearRemodAdd | -0.021998 | 0.040581 | 0.088866 | 0.013788 | 0.550684 | 0.073741 | 0.592855 | 1.000000 | 0.179618 | 0.128451 | ... | 0.205726 | 0.226298 | -0.193919 | 0.045286 | -0.038740 | 0.005829 | -0.010286 | 0.021490 | 0.035743 | 0.507101 |
| MasVnrArea | -0.050298 | 0.022936 | 0.193458 | 0.104160 | 0.411876 | -0.128101 | 0.315707 | 0.179618 | 1.000000 | 0.264736 | ... | 0.159718 | 0.125703 | -0.110204 | 0.018796 | 0.061466 | 0.011723 | -0.029815 | -0.005965 | -0.008201 | 0.477493 |
| BsmtFinSF1 | -0.005024 | -0.069836 | 0.233633 | 0.214103 | 0.239666 | -0.046231 | 0.249503 | 0.128451 | 0.264736 | 1.000000 | ... | 0.204306 | 0.111761 | -0.102303 | 0.026451 | 0.062021 | 0.140491 | 0.003571 | -0.015727 | 0.014359 | 0.386420 |
| BsmtFinSF2 | -0.005968 | -0.065649 | 0.049900 | 0.111170 | -0.059119 | 0.040229 | -0.049107 | -0.067759 | -0.072319 | -0.050117 | ... | 0.067898 | 0.003093 | 0.036543 | -0.029993 | 0.088871 | 0.041709 | 0.004940 | -0.015211 | 0.031706 | -0.011378 |
| BsmtUnfSF | -0.007940 | -0.140759 | 0.132644 | -0.002618 | 0.308159 | -0.136841 | 0.149040 | 0.181133 | 0.114442 | -0.495251 | ... | -0.005316 | 0.129005 | -0.002538 | 0.020764 | -0.012579 | -0.035092 | -0.023837 | 0.034888 | -0.041258 | 0.214479 |
| TotalBsmtSF | -0.015415 | -0.238518 | 0.392075 | 0.260833 | 0.537808 | -0.171098 | 0.391452 | 0.291066 | 0.363936 | 0.522396 | ... | 0.232019 | 0.247264 | -0.095478 | 0.037384 | 0.084489 | 0.126053 | -0.018479 | 0.013196 | -0.014969 | 0.613581 |
| 1stFlrSF | 0.010496 | -0.251758 | 0.457181 | 0.299475 | 0.476224 | -0.144203 | 0.281986 | 0.240379 | 0.344501 | 0.445863 | ... | 0.235459 | 0.211671 | -0.065292 | 0.056104 | 0.088758 | 0.131525 | -0.021096 | 0.031372 | -0.013604 | 0.605852 |
| 2ndFlrSF | 0.005590 | 0.307886 | 0.080177 | 0.050986 | 0.295493 | 0.028942 | 0.010308 | 0.140024 | 0.174561 | -0.137079 | ... | 0.092165 | 0.208026 | 0.061989 | -0.024358 | 0.040606 | 0.081487 | 0.016197 | 0.035164 | -0.028700 | 0.319334 |
| LowQualFinSF | -0.044230 | 0.046474 | 0.038469 | 0.004779 | -0.030429 | 0.025494 | -0.183784 | -0.062419 | -0.069071 | -0.064503 | ... | -0.025444 | 0.018251 | 0.061081 | -0.004296 | 0.026799 | 0.062157 | -0.003793 | -0.022174 | -0.028921 | -0.025606 |
| GrLivArea | 0.008273 | 0.074853 | 0.402797 | 0.263116 | 0.593007 | -0.079686 | 0.199010 | 0.287389 | 0.390857 | 0.208171 | ... | 0.247433 | 0.330224 | 0.009113 | 0.020643 | 0.101510 | 0.170205 | -0.002416 | 0.050240 | -0.036526 | 0.708624 |
| BsmtFullBath | 0.002289 | 0.003491 | 0.100949 | 0.158155 | 0.111098 | -0.054942 | 0.187599 | 0.119470 | 0.085310 | 0.649212 | ... | 0.175315 | 0.067341 | -0.049911 | -0.000106 | 0.023148 | 0.067616 | -0.023047 | -0.025361 | 0.067049 | 0.227122 |
| BsmtHalfBath | -0.020155 | -0.002333 | -0.007234 | 0.048046 | -0.040150 | 0.117821 | -0.038162 | -0.012337 | 0.026673 | 0.067418 | ... | 0.040161 | -0.025324 | -0.008555 | 0.035114 | 0.032121 | 0.020025 | -0.007367 | 0.032873 | -0.046524 | -0.016844 |
| FullBath | 0.005587 | 0.131608 | 0.198769 | 0.126031 | 0.550600 | -0.194149 | 0.468271 | 0.439046 | 0.276833 | 0.058543 | ... | 0.187703 | 0.259977 | -0.115093 | 0.035353 | -0.008106 | 0.049604 | -0.014290 | 0.055872 | -0.019669 | 0.560664 |
| HalfBath | 0.006784 | 0.177354 | 0.053532 | 0.014259 | 0.273458 | -0.060769 | 0.242656 | 0.183331 | 0.201444 | 0.004262 | ... | 0.108080 | 0.199740 | -0.095317 | -0.004972 | 0.072426 | 0.022381 | 0.001290 | -0.009050 | -0.010269 | 0.284108 |
| BedroomAbvGr | 0.037719 | -0.023438 | 0.263170 | 0.119690 | 0.101676 | 0.012980 | -0.070651 | -0.040581 | 0.102821 | -0.107355 | ... | 0.046854 | 0.093810 | 0.041570 | -0.024478 | 0.044300 | 0.070703 | 0.007767 | 0.046544 | -0.036014 | 0.168213 |
| KitchenAbvGr | 0.002951 | 0.281721 | -0.006069 | -0.017784 | -0.183882 | -0.087001 | -0.174800 | -0.149598 | -0.037610 | -0.081007 | ... | -0.090130 | -0.070091 | 0.037312 | -0.024600 | -0.051613 | -0.014525 | 0.062341 | 0.026589 | 0.031687 | -0.135907 |
| TotRmsAbvGrd | 0.027239 | 0.040380 | 0.352096 | 0.190015 | 0.427452 | -0.057583 | 0.095589 | 0.191740 | 0.280682 | 0.044316 | ... | 0.165984 | 0.234192 | 0.004151 | -0.006683 | 0.059383 | 0.083757 | 0.024763 | 0.036907 | -0.034516 | 0.533723 |
| Fireplaces | -0.019772 | -0.045569 | 0.266639 | 0.271364 | 0.396765 | -0.023820 | 0.147716 | 0.112581 | 0.249070 | 0.260011 | ... | 0.200019 | 0.169405 | -0.024822 | 0.011257 | 0.184530 | 0.095074 | 0.001409 | 0.046357 | -0.024096 | 0.466929 |
| GarageYrBlt | 0.000072 | 0.085072 | 0.070250 | -0.024947 | 0.547766 | -0.324297 | 0.825667 | 0.642277 | 0.252691 | 0.153484 | ... | 0.224577 | 0.228425 | -0.297003 | 0.023544 | -0.075418 | -0.014501 | -0.032417 | 0.005337 | -0.001014 | 0.486362 |
| GarageCars | 0.016570 | -0.040110 | 0.285691 | 0.154871 | 0.600671 | -0.185758 | 0.537850 | 0.420622 | 0.364204 | 0.224054 | ... | 0.226342 | 0.213569 | -0.151434 | 0.035765 | 0.050494 | 0.020934 | -0.043080 | 0.040522 | -0.039117 | 0.640409 |
| GarageArea | 0.017634 | -0.098672 | 0.344997 | 0.180403 | 0.562022 | -0.151521 | 0.478954 | 0.371600 | 0.373066 | 0.296970 | ... | 0.224666 | 0.241435 | -0.121777 | 0.035087 | 0.051412 | 0.061047 | -0.027400 | 0.027974 | -0.027378 | 0.623431 |
| WoodDeckSF | -0.029643 | -0.012579 | 0.088521 | 0.171698 | 0.238923 | -0.003334 | 0.224880 | 0.205726 | 0.159718 | 0.204306 | ... | 1.000000 | 0.058661 | -0.125989 | -0.032771 | -0.074181 | 0.073378 | -0.009551 | 0.021011 | 0.022270 | 0.324413 |
| OpenPorchSF | -0.000477 | -0.006100 | 0.151972 | 0.084774 | 0.308819 | -0.032589 | 0.188686 | 0.226298 | 0.125703 | 0.111761 | ... | 0.058661 | 1.000000 | -0.093079 | -0.005842 | 0.074304 | 0.060762 | -0.018584 | 0.071255 | -0.057619 | 0.315856 |
| EnclosedPorch | 0.002889 | -0.012037 | 0.010700 | -0.018340 | -0.113937 | 0.070356 | -0.387268 | -0.193919 | -0.110204 | -0.102303 | ... | -0.125989 | -0.093079 | 1.000000 | -0.037305 | -0.082864 | 0.054203 | 0.018361 | -0.028887 | -0.009916 | -0.128578 |
| 3SsnPorch | -0.046635 | -0.043825 | 0.070029 | 0.020423 | 0.030371 | 0.025504 | 0.031355 | 0.045286 | 0.018796 | 0.026451 | ... | -0.032771 | -0.005842 | -0.037305 | 1.000000 | -0.031436 | -0.007992 | 0.000354 | 0.029474 | 0.018645 | 0.044584 |
| ScreenPorch | 0.001330 | -0.026030 | 0.041383 | 0.043160 | 0.064886 | 0.054811 | -0.050364 | -0.038740 | 0.061466 | 0.062021 | ... | -0.074181 | 0.074304 | -0.082864 | -0.031436 | 1.000000 | 0.051307 | 0.031946 | 0.023217 | 0.010694 | 0.111447 |
| PoolArea | 0.057044 | 0.008283 | 0.206167 | 0.077672 | 0.065166 | -0.001985 | 0.004950 | 0.005829 | 0.011723 | 0.140491 | ... | 0.073378 | 0.060762 | 0.054203 | -0.007992 | 0.051307 | 1.000000 | 0.029669 | -0.033737 | -0.059689 | 0.092404 |
| MiscVal | -0.006242 | -0.007683 | 0.003368 | 0.038068 | -0.031406 | 0.068777 | -0.034383 | -0.010286 | -0.029815 | 0.003571 | ... | -0.009551 | -0.018584 | 0.018361 | 0.000354 | 0.031946 | 0.029669 | 1.000000 | -0.006495 | 0.004906 | -0.021190 |
| MoSold | 0.021172 | -0.013585 | 0.011200 | 0.001205 | 0.070815 | -0.003511 | 0.012398 | 0.021490 | -0.005965 | -0.015727 | ... | 0.021011 | 0.071255 | -0.028887 | 0.029474 | 0.023217 | -0.033737 | -0.006495 | 1.000000 | -0.145721 | 0.046432 |
| YrSold | 0.000712 | -0.021407 | 0.007450 | -0.014261 | -0.027347 | 0.043950 | -0.013618 | 0.035743 | -0.008201 | 0.014359 | ... | 0.022270 | -0.057619 | -0.009916 | 0.018645 | 0.010694 | -0.059689 | 0.004906 | -0.145721 | 1.000000 | -0.028923 |
| SalePrice | -0.021917 | -0.084284 | 0.351799 | 0.263843 | 0.790982 | -0.077856 | 0.522897 | 0.507101 | 0.477493 | 0.386420 | ... | 0.324413 | 0.315856 | -0.128578 | 0.044584 | 0.111447 | 0.092404 | -0.021190 | 0.046432 | -0.028923 | 1.000000 |
38 rows × 38 columns
#plotting correlation heatmap
plt.figure(figsize=(50,20))
sns.heatmap(cor, cmap="YlGnBu", annot = True)
plt.show()
y = SalePrice
msno.bar(Data_train)
<Axes: >
Missing Data columns: Alley, PoolIQC, Fernce, Miscfeature
Remove Id column as it is index column
Dropping above columns
FireplaceQC - 50% data missing check and treat
# Drop columns with high missing values
col_to_drop = ['Id','Alley','PoolQC','Fence','MiscFeature']
Data_train = Data_train.drop(col_to_drop, axis = 1)
Data_train.shape
(1460, 76)
col_zero_count = Data_train.eq(0).sum(axis=0)
col_zero_count[col_zero_count > 1000]
BsmtFinSF2 1293 LowQualFinSF 1434 BsmtHalfBath 1378 EnclosedPorch 1252 3SsnPorch 1436 ScreenPorch 1344 PoolArea 1453 MiscVal 1408 dtype: int64
Data_train.hist(bins=50,figsize=(20,15))
array([[<Axes: title={'center': 'MSSubClass'}>,
<Axes: title={'center': 'LotFrontage'}>,
<Axes: title={'center': 'LotArea'}>,
<Axes: title={'center': 'OverallQual'}>,
<Axes: title={'center': 'OverallCond'}>,
<Axes: title={'center': 'YearBuilt'}>],
[<Axes: title={'center': 'YearRemodAdd'}>,
<Axes: title={'center': 'MasVnrArea'}>,
<Axes: title={'center': 'BsmtFinSF1'}>,
<Axes: title={'center': 'BsmtFinSF2'}>,
<Axes: title={'center': 'BsmtUnfSF'}>,
<Axes: title={'center': 'TotalBsmtSF'}>],
[<Axes: title={'center': '1stFlrSF'}>,
<Axes: title={'center': '2ndFlrSF'}>,
<Axes: title={'center': 'LowQualFinSF'}>,
<Axes: title={'center': 'GrLivArea'}>,
<Axes: title={'center': 'BsmtFullBath'}>,
<Axes: title={'center': 'BsmtHalfBath'}>],
[<Axes: title={'center': 'FullBath'}>,
<Axes: title={'center': 'HalfBath'}>,
<Axes: title={'center': 'BedroomAbvGr'}>,
<Axes: title={'center': 'KitchenAbvGr'}>,
<Axes: title={'center': 'TotRmsAbvGrd'}>,
<Axes: title={'center': 'Fireplaces'}>],
[<Axes: title={'center': 'GarageYrBlt'}>,
<Axes: title={'center': 'GarageCars'}>,
<Axes: title={'center': 'GarageArea'}>,
<Axes: title={'center': 'WoodDeckSF'}>,
<Axes: title={'center': 'OpenPorchSF'}>,
<Axes: title={'center': 'EnclosedPorch'}>],
[<Axes: title={'center': '3SsnPorch'}>,
<Axes: title={'center': 'ScreenPorch'}>,
<Axes: title={'center': 'PoolArea'}>,
<Axes: title={'center': 'MiscVal'}>,
<Axes: title={'center': 'MoSold'}>,
<Axes: title={'center': 'YrSold'}>],
[<Axes: title={'center': 'SalePrice'}>, <Axes: >, <Axes: >,
<Axes: >, <Axes: >, <Axes: >]], dtype=object)
There are columns having only/majorly zero as values which will not help with further analysis, Can remove those columns
Col_to_del = col_zero_count[col_zero_count > 1168]
#Data_train[Col_to_del].hist(bins=50,figsize=(20,15))
Col_to_drop_list = ['BsmtFinSF2','LowQualFinSF','BsmtHalfBath','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal']
Data_train[Col_to_drop_list].hist(bins=50,figsize=(20,15))
array([[<Axes: title={'center': 'BsmtFinSF2'}>,
<Axes: title={'center': 'LowQualFinSF'}>,
<Axes: title={'center': 'BsmtHalfBath'}>],
[<Axes: title={'center': 'EnclosedPorch'}>,
<Axes: title={'center': '3SsnPorch'}>,
<Axes: title={'center': 'ScreenPorch'}>],
[<Axes: title={'center': 'PoolArea'}>,
<Axes: title={'center': 'MiscVal'}>, <Axes: >]], dtype=object)
#Drop columns where most of the values are zero
Data_train = Data_train.drop(Col_to_drop_list, axis = 1)
Data_train.shape
(1460, 68)
cor = Data_train.corr()
#plotting correlation heatmap
plt.figure(figsize=(50,20))
sns.heatmap(cor, cmap="YlGnBu", annot = True)
plt.show()
cor["SalePrice"].sort_values(ascending=False)
SalePrice 1.000000 OverallQual 0.790982 GrLivArea 0.708624 GarageCars 0.640409 GarageArea 0.623431 TotalBsmtSF 0.613581 1stFlrSF 0.605852 FullBath 0.560664 TotRmsAbvGrd 0.533723 YearBuilt 0.522897 YearRemodAdd 0.507101 GarageYrBlt 0.486362 MasVnrArea 0.477493 Fireplaces 0.466929 BsmtFinSF1 0.386420 LotFrontage 0.351799 WoodDeckSF 0.324413 2ndFlrSF 0.319334 OpenPorchSF 0.315856 HalfBath 0.284108 LotArea 0.263843 BsmtFullBath 0.227122 BsmtUnfSF 0.214479 BedroomAbvGr 0.168213 MoSold 0.046432 YrSold -0.028923 OverallCond -0.077856 MSSubClass -0.084284 KitchenAbvGr -0.135907 Name: SalePrice, dtype: float64
Highest correlation is with OverallQual -Saleprice - 0.79 (categorical variable), next highest is GrLivArea 0.71 - check relationship by plotting graph with saleprice
#pair plot with OverallQual (categorical variable) and saleprice
plt.figure(figsize = [6,6])
plt.scatter(Data_train.SalePrice,Data_train.OverallQual)
plt.show()
#pair plot with GrLivArea and saleprice
plt.figure(figsize = [6,6])
plt.scatter(Data_train.SalePrice,Data_train.GrLivArea)
plt.show()
With some outliers - some linear relationship is observed
Data_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 68 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MSSubClass 1460 non-null int64 1 MSZoning 1460 non-null object 2 LotFrontage 1201 non-null float64 3 LotArea 1460 non-null int64 4 Street 1460 non-null object 5 LotShape 1460 non-null object 6 LandContour 1460 non-null object 7 Utilities 1460 non-null object 8 LotConfig 1460 non-null object 9 LandSlope 1460 non-null object 10 Neighborhood 1460 non-null object 11 Condition1 1460 non-null object 12 Condition2 1460 non-null object 13 BldgType 1460 non-null object 14 HouseStyle 1460 non-null object 15 OverallQual 1460 non-null int64 16 OverallCond 1460 non-null int64 17 YearBuilt 1460 non-null int64 18 YearRemodAdd 1460 non-null int64 19 RoofStyle 1460 non-null object 20 RoofMatl 1460 non-null object 21 Exterior1st 1460 non-null object 22 Exterior2nd 1460 non-null object 23 MasVnrType 1452 non-null object 24 MasVnrArea 1452 non-null float64 25 ExterQual 1460 non-null object 26 ExterCond 1460 non-null object 27 Foundation 1460 non-null object 28 BsmtQual 1423 non-null object 29 BsmtCond 1423 non-null object 30 BsmtExposure 1422 non-null object 31 BsmtFinType1 1423 non-null object 32 BsmtFinSF1 1460 non-null int64 33 BsmtFinType2 1422 non-null object 34 BsmtUnfSF 1460 non-null int64 35 TotalBsmtSF 1460 non-null int64 36 Heating 1460 non-null object 37 HeatingQC 1460 non-null object 38 CentralAir 1460 non-null object 39 Electrical 1459 non-null object 40 1stFlrSF 1460 non-null int64 41 2ndFlrSF 1460 non-null int64 42 GrLivArea 1460 non-null int64 43 BsmtFullBath 1460 non-null int64 44 FullBath 1460 non-null int64 45 HalfBath 1460 non-null int64 46 BedroomAbvGr 1460 non-null int64 47 KitchenAbvGr 1460 non-null int64 48 KitchenQual 1460 non-null object 49 TotRmsAbvGrd 1460 non-null int64 50 Functional 1460 non-null object 51 Fireplaces 1460 non-null int64 52 FireplaceQu 770 non-null object 53 GarageType 1379 non-null object 54 GarageYrBlt 1379 non-null float64 55 GarageFinish 1379 non-null object 56 GarageCars 1460 non-null int64 57 GarageArea 1460 non-null int64 58 GarageQual 1379 non-null object 59 GarageCond 1379 non-null object 60 PavedDrive 1460 non-null object 61 WoodDeckSF 1460 non-null int64 62 OpenPorchSF 1460 non-null int64 63 MoSold 1460 non-null int64 64 YrSold 1460 non-null int64 65 SaleType 1460 non-null object 66 SaleCondition 1460 non-null object 67 SalePrice 1460 non-null int64 dtypes: float64(3), int64(26), object(39) memory usage: 775.8+ KB
### Handling Year features
#YearBuilt 1460 non-null int64
#YearRemodAdd 1460 non-null int64
#YrSold 1460 non-null int64
#GarageYrBlt 1379 non-null float64
Data_train['age'] = Data_train['YrSold']-Data_train['YearBuilt']
Data_train['age'].describe()
count 1460.000000 mean 36.547945 std 30.250152 min 0.000000 25% 8.000000 50% 35.000000 75% 54.000000 max 136.000000 Name: age, dtype: float64
Data_train['age_Remod'] = Data_train['YrSold']-Data_train['YearRemodAdd']
Data_train['age_Remod'].describe()
count 1460.000000 mean 22.950000 std 20.640653 min -1.000000 25% 4.000000 50% 14.000000 75% 41.000000 max 60.000000 Name: age_Remod, dtype: float64
Data_train['GarageYrBlt'].fillna(0.0,inplace=True)
Data_train['age_Garage'] = Data_train['YrSold']-Data_train['GarageYrBlt']
Data_train['age_Garage'].describe()
count 1460.000000 mean 139.076027 std 453.714026 min 0.000000 25% 7.000000 50% 30.000000 75% 50.000000 max 2010.000000 Name: age_Garage, dtype: float64
#delete the year columns
Col_to_drop_list = ['YearBuilt','YearRemodAdd','YrSold','GarageYrBlt']
Data_train = Data_train.drop(Col_to_drop_list, axis = 1)
Data_train.shape
(1460, 67)
cor = Data_train.corr()
#plotting correlation heatmap
plt.figure(figsize=(50,20))
sns.heatmap(cor, cmap="YlGnBu", annot = True)
plt.show()
Data_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 67 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MSSubClass 1460 non-null int64 1 MSZoning 1460 non-null object 2 LotFrontage 1201 non-null float64 3 LotArea 1460 non-null int64 4 Street 1460 non-null object 5 LotShape 1460 non-null object 6 LandContour 1460 non-null object 7 Utilities 1460 non-null object 8 LotConfig 1460 non-null object 9 LandSlope 1460 non-null object 10 Neighborhood 1460 non-null object 11 Condition1 1460 non-null object 12 Condition2 1460 non-null object 13 BldgType 1460 non-null object 14 HouseStyle 1460 non-null object 15 OverallQual 1460 non-null int64 16 OverallCond 1460 non-null int64 17 RoofStyle 1460 non-null object 18 RoofMatl 1460 non-null object 19 Exterior1st 1460 non-null object 20 Exterior2nd 1460 non-null object 21 MasVnrType 1452 non-null object 22 MasVnrArea 1452 non-null float64 23 ExterQual 1460 non-null object 24 ExterCond 1460 non-null object 25 Foundation 1460 non-null object 26 BsmtQual 1423 non-null object 27 BsmtCond 1423 non-null object 28 BsmtExposure 1422 non-null object 29 BsmtFinType1 1423 non-null object 30 BsmtFinSF1 1460 non-null int64 31 BsmtFinType2 1422 non-null object 32 BsmtUnfSF 1460 non-null int64 33 TotalBsmtSF 1460 non-null int64 34 Heating 1460 non-null object 35 HeatingQC 1460 non-null object 36 CentralAir 1460 non-null object 37 Electrical 1459 non-null object 38 1stFlrSF 1460 non-null int64 39 2ndFlrSF 1460 non-null int64 40 GrLivArea 1460 non-null int64 41 BsmtFullBath 1460 non-null int64 42 FullBath 1460 non-null int64 43 HalfBath 1460 non-null int64 44 BedroomAbvGr 1460 non-null int64 45 KitchenAbvGr 1460 non-null int64 46 KitchenQual 1460 non-null object 47 TotRmsAbvGrd 1460 non-null int64 48 Functional 1460 non-null object 49 Fireplaces 1460 non-null int64 50 FireplaceQu 770 non-null object 51 GarageType 1379 non-null object 52 GarageFinish 1379 non-null object 53 GarageCars 1460 non-null int64 54 GarageArea 1460 non-null int64 55 GarageQual 1379 non-null object 56 GarageCond 1379 non-null object 57 PavedDrive 1460 non-null object 58 WoodDeckSF 1460 non-null int64 59 OpenPorchSF 1460 non-null int64 60 MoSold 1460 non-null int64 61 SaleType 1460 non-null object 62 SaleCondition 1460 non-null object 63 SalePrice 1460 non-null int64 64 age 1460 non-null int64 65 age_Remod 1460 non-null int64 66 age_Garage 1460 non-null float64 dtypes: float64(3), int64(25), object(39) memory usage: 764.3+ KB
# Handle all the rest of missing values with me value for numeric variables
Data_train.fillna(Data_train.median(),inplace=True)
Data_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 67 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MSSubClass 1460 non-null int64 1 MSZoning 1460 non-null object 2 LotFrontage 1460 non-null float64 3 LotArea 1460 non-null int64 4 Street 1460 non-null object 5 LotShape 1460 non-null object 6 LandContour 1460 non-null object 7 Utilities 1460 non-null object 8 LotConfig 1460 non-null object 9 LandSlope 1460 non-null object 10 Neighborhood 1460 non-null object 11 Condition1 1460 non-null object 12 Condition2 1460 non-null object 13 BldgType 1460 non-null object 14 HouseStyle 1460 non-null object 15 OverallQual 1460 non-null int64 16 OverallCond 1460 non-null int64 17 RoofStyle 1460 non-null object 18 RoofMatl 1460 non-null object 19 Exterior1st 1460 non-null object 20 Exterior2nd 1460 non-null object 21 MasVnrType 1452 non-null object 22 MasVnrArea 1460 non-null float64 23 ExterQual 1460 non-null object 24 ExterCond 1460 non-null object 25 Foundation 1460 non-null object 26 BsmtQual 1423 non-null object 27 BsmtCond 1423 non-null object 28 BsmtExposure 1422 non-null object 29 BsmtFinType1 1423 non-null object 30 BsmtFinSF1 1460 non-null int64 31 BsmtFinType2 1422 non-null object 32 BsmtUnfSF 1460 non-null int64 33 TotalBsmtSF 1460 non-null int64 34 Heating 1460 non-null object 35 HeatingQC 1460 non-null object 36 CentralAir 1460 non-null object 37 Electrical 1459 non-null object 38 1stFlrSF 1460 non-null int64 39 2ndFlrSF 1460 non-null int64 40 GrLivArea 1460 non-null int64 41 BsmtFullBath 1460 non-null int64 42 FullBath 1460 non-null int64 43 HalfBath 1460 non-null int64 44 BedroomAbvGr 1460 non-null int64 45 KitchenAbvGr 1460 non-null int64 46 KitchenQual 1460 non-null object 47 TotRmsAbvGrd 1460 non-null int64 48 Functional 1460 non-null object 49 Fireplaces 1460 non-null int64 50 FireplaceQu 770 non-null object 51 GarageType 1379 non-null object 52 GarageFinish 1379 non-null object 53 GarageCars 1460 non-null int64 54 GarageArea 1460 non-null int64 55 GarageQual 1379 non-null object 56 GarageCond 1379 non-null object 57 PavedDrive 1460 non-null object 58 WoodDeckSF 1460 non-null int64 59 OpenPorchSF 1460 non-null int64 60 MoSold 1460 non-null int64 61 SaleType 1460 non-null object 62 SaleCondition 1460 non-null object 63 SalePrice 1460 non-null int64 64 age 1460 non-null int64 65 age_Remod 1460 non-null int64 66 age_Garage 1460 non-null float64 dtypes: float64(3), int64(25), object(39) memory usage: 764.3+ KB
# Categorical variables
# Handle all the rest of missing values with NO as NA (not avaialable) is used if that feature not exist (eg. NA No Basement data dictonary)
Data_train.fillna('NO',inplace=True)
Data_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 67 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MSSubClass 1460 non-null int64 1 MSZoning 1460 non-null object 2 LotFrontage 1460 non-null float64 3 LotArea 1460 non-null int64 4 Street 1460 non-null object 5 LotShape 1460 non-null object 6 LandContour 1460 non-null object 7 Utilities 1460 non-null object 8 LotConfig 1460 non-null object 9 LandSlope 1460 non-null object 10 Neighborhood 1460 non-null object 11 Condition1 1460 non-null object 12 Condition2 1460 non-null object 13 BldgType 1460 non-null object 14 HouseStyle 1460 non-null object 15 OverallQual 1460 non-null int64 16 OverallCond 1460 non-null int64 17 RoofStyle 1460 non-null object 18 RoofMatl 1460 non-null object 19 Exterior1st 1460 non-null object 20 Exterior2nd 1460 non-null object 21 MasVnrType 1460 non-null object 22 MasVnrArea 1460 non-null float64 23 ExterQual 1460 non-null object 24 ExterCond 1460 non-null object 25 Foundation 1460 non-null object 26 BsmtQual 1460 non-null object 27 BsmtCond 1460 non-null object 28 BsmtExposure 1460 non-null object 29 BsmtFinType1 1460 non-null object 30 BsmtFinSF1 1460 non-null int64 31 BsmtFinType2 1460 non-null object 32 BsmtUnfSF 1460 non-null int64 33 TotalBsmtSF 1460 non-null int64 34 Heating 1460 non-null object 35 HeatingQC 1460 non-null object 36 CentralAir 1460 non-null object 37 Electrical 1460 non-null object 38 1stFlrSF 1460 non-null int64 39 2ndFlrSF 1460 non-null int64 40 GrLivArea 1460 non-null int64 41 BsmtFullBath 1460 non-null int64 42 FullBath 1460 non-null int64 43 HalfBath 1460 non-null int64 44 BedroomAbvGr 1460 non-null int64 45 KitchenAbvGr 1460 non-null int64 46 KitchenQual 1460 non-null object 47 TotRmsAbvGrd 1460 non-null int64 48 Functional 1460 non-null object 49 Fireplaces 1460 non-null int64 50 FireplaceQu 1460 non-null object 51 GarageType 1460 non-null object 52 GarageFinish 1460 non-null object 53 GarageCars 1460 non-null int64 54 GarageArea 1460 non-null int64 55 GarageQual 1460 non-null object 56 GarageCond 1460 non-null object 57 PavedDrive 1460 non-null object 58 WoodDeckSF 1460 non-null int64 59 OpenPorchSF 1460 non-null int64 60 MoSold 1460 non-null int64 61 SaleType 1460 non-null object 62 SaleCondition 1460 non-null object 63 SalePrice 1460 non-null int64 64 age 1460 non-null int64 65 age_Remod 1460 non-null int64 66 age_Garage 1460 non-null float64 dtypes: float64(3), int64(25), object(39) memory usage: 764.3+ KB
#list of cat variables converted to dummy variables
cat_col_list = []
#list of features need to be deleted which has one value more than 80%
col_del_list = []
Data_train['MSSubClass'].astype('category').value_counts()
20 536 60 299 50 144 120 87 30 69 160 63 70 60 80 58 90 52 190 30 85 20 75 16 45 12 180 10 40 4 Name: MSSubClass, dtype: int64
MSSubClass = pd.get_dummies(Data_train.MSSubClass,drop_first=True)
MSSubClass = MSSubClass.add_prefix("MSSubClass_")
cat_col_list.append('MSSubClass')
MSSubClass
| MSSubClass_30 | MSSubClass_40 | MSSubClass_45 | MSSubClass_50 | MSSubClass_60 | MSSubClass_70 | MSSubClass_75 | MSSubClass_80 | MSSubClass_85 | MSSubClass_90 | MSSubClass_120 | MSSubClass_160 | MSSubClass_180 | MSSubClass_190 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1455 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1456 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1457 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1458 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1459 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1460 rows × 14 columns
Data_train['MSZoning'].astype('category').value_counts()
RL 1151 RM 218 FV 65 RH 16 C (all) 10 Name: MSZoning, dtype: int64
MSZoning = pd.get_dummies(Data_train.MSZoning,drop_first=True)
MSZoning = MSZoning.add_prefix("MSZoning_")
cat_col_list.append('MSZoning')
Data_train['Street'].astype('category').value_counts()
Pave 1454 Grvl 6 Name: Street, dtype: int64
col_del_list.append('Street')
more than 90% is having same value so this column can be deleted
Data_train['LotShape'].astype('category').value_counts()
Reg 925 IR1 484 IR2 41 IR3 10 Name: LotShape, dtype: int64
LotShape = pd.get_dummies(Data_train.LotShape,drop_first=True)
LotShape = LotShape.add_prefix("LotShape_")
cat_col_list.append('LotShape')
Data_train['LandContour'].astype('category').value_counts()
Lvl 1311 Bnk 63 HLS 50 Low 36 Name: LandContour, dtype: int64
col_del_list.append('LandContour')
Data_train['Utilities'].astype('category').value_counts()
AllPub 1459 NoSeWa 1 Name: Utilities, dtype: int64
col_del_list.append('Utilities')
more than 80% is having same value so this column can be deleted
Data_train['LotConfig'].astype('category').value_counts()
Inside 1052 Corner 263 CulDSac 94 FR2 47 FR3 4 Name: LotConfig, dtype: int64
LotConfig = pd.get_dummies(Data_train.LotConfig,drop_first=True)
LotConfig = LotConfig.add_prefix("LotConfig_")
cat_col_list.append('LotConfig')
Data_train['LandSlope'].astype('category').value_counts()
Gtl 1382 Mod 65 Sev 13 Name: LandSlope, dtype: int64
col_del_list.append('LandSlope')
Data_train['Neighborhood'].astype('category').value_counts()
NAmes 225 CollgCr 150 OldTown 113 Edwards 100 Somerst 86 Gilbert 79 NridgHt 77 Sawyer 74 NWAmes 73 SawyerW 59 BrkSide 58 Crawfor 51 Mitchel 49 NoRidge 41 Timber 38 IDOTRR 37 ClearCr 28 StoneBr 25 SWISU 25 Blmngtn 17 MeadowV 17 BrDale 16 Veenker 11 NPkVill 9 Blueste 2 Name: Neighborhood, dtype: int64
Neighborhood = pd.get_dummies(Data_train.Neighborhood,drop_first=True)
Neighborhood = Neighborhood.add_prefix("Neighborhood_")
cat_col_list.append('Neighborhood')
Data_train['Condition1'].astype('category').value_counts()
Norm 1260 Feedr 81 Artery 48 RRAn 26 PosN 19 RRAe 11 PosA 8 RRNn 5 RRNe 2 Name: Condition1, dtype: int64
col_del_list.append('Condition1')
Data_train['Condition2'].astype('category').value_counts()
Norm 1445 Feedr 6 Artery 2 PosN 2 RRNn 2 PosA 1 RRAe 1 RRAn 1 Name: Condition2, dtype: int64
col_del_list.append('Condition2')
Data_train['BldgType'].astype('category').value_counts()
1Fam 1220 TwnhsE 114 Duplex 52 Twnhs 43 2fmCon 31 Name: BldgType, dtype: int64
col_del_list.append('BldgType')
Data_train['HouseStyle'].astype('category').value_counts()
1Story 726 2Story 445 1.5Fin 154 SLvl 65 SFoyer 37 1.5Unf 14 2.5Unf 11 2.5Fin 8 Name: HouseStyle, dtype: int64
HouseStyle = pd.get_dummies(Data_train.HouseStyle,drop_first=True)
HouseStyle = HouseStyle.add_prefix("HouseStyle_")
cat_col_list.append('HouseStyle')
Data_train['RoofStyle'].astype('category').value_counts()
Gable 1141 Hip 286 Flat 13 Gambrel 11 Mansard 7 Shed 2 Name: RoofStyle, dtype: int64
RoofStyle = pd.get_dummies(Data_train.RoofStyle,drop_first=True)
RoofStyle = RoofStyle.add_prefix("RoofStyle_")
cat_col_list.append('RoofStyle')
Data_train['RoofMatl'].astype('category').value_counts()
CompShg 1434 Tar&Grv 11 WdShngl 6 WdShake 5 ClyTile 1 Membran 1 Metal 1 Roll 1 Name: RoofMatl, dtype: int64
# Add to del list
col_del_list.append('RoofMatl')
Data_train['Exterior1st'].astype('category').value_counts()
VinylSd 515 HdBoard 222 MetalSd 220 Wd Sdng 206 Plywood 108 CemntBd 61 BrkFace 50 WdShing 26 Stucco 25 AsbShng 20 BrkComm 2 Stone 2 AsphShn 1 CBlock 1 ImStucc 1 Name: Exterior1st, dtype: int64
Exterior1st = pd.get_dummies(Data_train.Exterior1st,drop_first=True)
Exterior1st = Exterior1st.add_prefix("Exterior1st_")
cat_col_list.append('Exterior1st')
Data_train['Exterior2nd'].astype('category').value_counts()
VinylSd 504 MetalSd 214 HdBoard 207 Wd Sdng 197 Plywood 142 CmentBd 60 Wd Shng 38 Stucco 26 BrkFace 25 AsbShng 20 ImStucc 10 Brk Cmn 7 Stone 5 AsphShn 3 CBlock 1 Other 1 Name: Exterior2nd, dtype: int64
Exterior2nd = pd.get_dummies(Data_train.Exterior2nd,drop_first=True)
Exterior2nd = Exterior2nd.add_prefix("Exterior2nd_")
cat_col_list.append('Exterior2nd')
Data_train['MasVnrType'].astype('category').value_counts()
None 864 BrkFace 445 Stone 128 BrkCmn 15 NO 8 Name: MasVnrType, dtype: int64
MasVnrType = pd.get_dummies(Data_train.MasVnrType,drop_first=True)
MasVnrType = MasVnrType.add_prefix("MasVnrType_")
cat_col_list.append('MasVnrType')
Data_train['ExterQual'].astype('category').value_counts()
TA 906 Gd 488 Ex 52 Fa 14 Name: ExterQual, dtype: int64
ExterQual = pd.get_dummies(Data_train.ExterQual,drop_first=True)
ExterQual = ExterQual.add_prefix("ExterQual_")
cat_col_list.append('ExterQual')
Data_train['ExterCond'].astype('category').value_counts()
TA 1282 Gd 146 Fa 28 Ex 3 Po 1 Name: ExterCond, dtype: int64
# Add to del list
col_del_list.append('ExterCond')
Data_train['Foundation'].astype('category').value_counts()
PConc 647 CBlock 634 BrkTil 146 Slab 24 Stone 6 Wood 3 Name: Foundation, dtype: int64
Foundation = pd.get_dummies(Data_train.Foundation,drop_first=True)
Foundation = Foundation.add_prefix("Foundation_")
cat_col_list.append('Foundation')
Data_train['BsmtQual'].astype('category').value_counts()
TA 649 Gd 618 Ex 121 NO 37 Fa 35 Name: BsmtQual, dtype: int64
BsmtQual = pd.get_dummies(Data_train.BsmtQual,drop_first=True)
BsmtQual = BsmtQual.add_prefix("BsmtQual_")
cat_col_list.append('BsmtQual')
Data_train['BsmtCond'].astype('category').value_counts()
TA 1311 Gd 65 Fa 45 NO 37 Po 2 Name: BsmtCond, dtype: int64
# Add to del list
col_del_list.append('BsmtCond')
Data_train['BsmtExposure'].astype('category').value_counts()
No 953 Av 221 Gd 134 Mn 114 NO 38 Name: BsmtExposure, dtype: int64
BsmtExposure = pd.get_dummies(Data_train.BsmtExposure,drop_first=True)
BsmtExposure = BsmtExposure.add_prefix("BsmtExposure_")
cat_col_list.append('BsmtExposure')
Data_train['BsmtFinType1'].astype('category').value_counts()
Unf 430 GLQ 418 ALQ 220 BLQ 148 Rec 133 LwQ 74 NO 37 Name: BsmtFinType1, dtype: int64
BsmtFinType1 = pd.get_dummies(Data_train.BsmtFinType1,drop_first=True)
BsmtFinType1 = BsmtFinType1.add_prefix("BsmtFinType1_")
cat_col_list.append('BsmtFinType1')
Data_train['BsmtFinType2'].astype('category').value_counts()
Unf 1256 Rec 54 LwQ 46 NO 38 BLQ 33 ALQ 19 GLQ 14 Name: BsmtFinType2, dtype: int64
# Add to del list
col_del_list.append('BsmtFinType2')
Data_train['Heating'].astype('category').value_counts()
GasA 1428 GasW 18 Grav 7 Wall 4 OthW 2 Floor 1 Name: Heating, dtype: int64
# Add to del list
col_del_list.append('Heating')
Data_train['HeatingQC'].astype('category').value_counts()
Ex 741 TA 428 Gd 241 Fa 49 Po 1 Name: HeatingQC, dtype: int64
HeatingQC = pd.get_dummies(Data_train.HeatingQC,drop_first=True)
HeatingQC = HeatingQC.add_prefix("HeatingQC_")
cat_col_list.append('HeatingQC')
Data_train['CentralAir'].astype('category').value_counts()
Y 1365 N 95 Name: CentralAir, dtype: int64
# Add to del list
col_del_list.append('CentralAir')
Data_train['Electrical'].astype('category').value_counts()
SBrkr 1334 FuseA 94 FuseF 27 FuseP 3 Mix 1 NO 1 Name: Electrical, dtype: int64
# Add to del list
col_del_list.append('Electrical')
Data_train['KitchenQual'].astype('category').value_counts()
TA 735 Gd 586 Ex 100 Fa 39 Name: KitchenQual, dtype: int64
KitchenQual = pd.get_dummies(Data_train.KitchenQual,drop_first=True)
KitchenQual = KitchenQual.add_prefix("KitchenQual_")
cat_col_list.append('KitchenQual')
Data_train['Functional'].astype('category').value_counts()
Typ 1360 Min2 34 Min1 31 Mod 15 Maj1 14 Maj2 5 Sev 1 Name: Functional, dtype: int64
# Add to del list
col_del_list.append('Functional')
Data_train['Fireplaces'].astype('category').value_counts()
0 690 1 650 2 115 3 5 Name: Fireplaces, dtype: int64
Data_train['FireplaceQu'].astype('category').value_counts()
NO 690 Gd 380 TA 313 Fa 33 Ex 24 Po 20 Name: FireplaceQu, dtype: int64
FireplaceQu = pd.get_dummies(Data_train.FireplaceQu,drop_first=True)
FireplaceQu = FireplaceQu.add_prefix("FireplaceQu_")
cat_col_list.append('FireplaceQu')
missing values are for Fireplaces - 0, replace NA with NF - No Fireplace
### Handling missing values of FireplaceQU - missing values are for Fireplaces - 0, replace NA with NO - No Fireplace
#Data_train.FireplaceQu.fillna('NO',inplace=True)
#Data_train['FireplaceQu'].astype('category').value_counts()
Data_train['GarageType'].astype('category').value_counts()
Attchd 870 Detchd 387 BuiltIn 88 NO 81 Basment 19 CarPort 9 2Types 6 Name: GarageType, dtype: int64
GarageType = pd.get_dummies(Data_train.GarageType,drop_first=True)
GarageType = GarageType.add_prefix("GarageType_")
cat_col_list.append('GarageType')
Data_train['GarageFinish'].astype('category').value_counts()
Unf 605 RFn 422 Fin 352 NO 81 Name: GarageFinish, dtype: int64
GarageFinish = pd.get_dummies(Data_train.GarageFinish,drop_first=True)
GarageFinish = GarageFinish.add_prefix("GarageFinish_")
cat_col_list.append('GarageFinish')
Data_train['GarageQual'].astype('category').value_counts()
TA 1311 NO 81 Fa 48 Gd 14 Ex 3 Po 3 Name: GarageQual, dtype: int64
# Add to del list
col_del_list.append('GarageQual')
Data_train['GarageCond'].astype('category').value_counts()
TA 1326 NO 81 Fa 35 Gd 9 Po 7 Ex 2 Name: GarageCond, dtype: int64
# Add to del list
col_del_list.append('GarageCond')
Data_train['PavedDrive'].astype('category').value_counts()
Y 1340 N 90 P 30 Name: PavedDrive, dtype: int64
# Add to del list
col_del_list.append('PavedDrive')
Data_train['SaleType'].astype('category').value_counts()
WD 1267 New 122 COD 43 ConLD 9 ConLI 5 ConLw 5 CWD 4 Oth 3 Con 2 Name: SaleType, dtype: int64
# Add to del list
col_del_list.append('SaleType')
Data_train['SaleCondition'].astype('category').value_counts()
Normal 1198 Partial 125 Abnorml 101 Family 20 Alloca 12 AdjLand 4 Name: SaleCondition, dtype: int64
# Add to del list
col_del_list.append('SaleCondition')
Data_train['KitchenAbvGr'].astype('category').value_counts().plot.bar()
<Axes: >
# Add to del list
col_del_list.append('KitchenAbvGr')
#numeric cat varaibles
OverallQual = pd.get_dummies(Data_train.OverallQual,drop_first=True)
OverallQual = OverallQual.add_prefix("OverallQual_")
cat_col_list.append('OverallQual')
#numeric cat varaibles
OverallCond = pd.get_dummies(Data_train.OverallCond,drop_first=True)
OverallCond = OverallCond.add_prefix("OverallCond_")
cat_col_list.append('OverallCond')
#numeric cat varaibles month
MoSold = pd.get_dummies(Data_train.MoSold,drop_first=True)
MoSold = MoSold.add_prefix("MoSold_")
cat_col_list.append('MoSold')
# delete the feature having 80% or more same values 1460*0.8 = 1168
#col_list_del = ['Street','Utilities','LandContour','LandSlope','Condition1','Condition2','BldgType','RoofMatl','ExterCond','BsmtCond','BsmtFinType2','Heating','CentralAir','Electrical','Functional','GarageQual','GarageCond','PavedDrive','SaleType','SaleCondition','KitchenAbvGr']
Data_train = Data_train.drop(col_del_list, axis = 1)
Data_train.shape
(1460, 46)
Data_train.head()
| MSSubClass | MSZoning | LotFrontage | LotArea | LotShape | LotConfig | Neighborhood | HouseStyle | OverallQual | OverallCond | ... | GarageFinish | GarageCars | GarageArea | WoodDeckSF | OpenPorchSF | MoSold | SalePrice | age | age_Remod | age_Garage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 60 | RL | 65.0 | 8450 | Reg | Inside | CollgCr | 2Story | 7 | 5 | ... | RFn | 2 | 548 | 0 | 61 | 2 | 208500 | 5 | 5 | 5.0 |
| 1 | 20 | RL | 80.0 | 9600 | Reg | FR2 | Veenker | 1Story | 6 | 8 | ... | RFn | 2 | 460 | 298 | 0 | 5 | 181500 | 31 | 31 | 31.0 |
| 2 | 60 | RL | 68.0 | 11250 | IR1 | Inside | CollgCr | 2Story | 7 | 5 | ... | RFn | 2 | 608 | 0 | 42 | 9 | 223500 | 7 | 6 | 7.0 |
| 3 | 70 | RL | 60.0 | 9550 | IR1 | Corner | Crawfor | 2Story | 7 | 5 | ... | Unf | 3 | 642 | 0 | 35 | 2 | 140000 | 91 | 36 | 8.0 |
| 4 | 60 | RL | 84.0 | 14260 | IR1 | FR2 | NoRidge | 2Story | 8 | 5 | ... | RFn | 3 | 836 | 192 | 84 | 12 | 250000 | 8 | 8 | 8.0 |
5 rows × 46 columns
After deleting unwanted columns there are 46 featues
Data_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 46 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MSSubClass 1460 non-null int64 1 MSZoning 1460 non-null object 2 LotFrontage 1460 non-null float64 3 LotArea 1460 non-null int64 4 LotShape 1460 non-null object 5 LotConfig 1460 non-null object 6 Neighborhood 1460 non-null object 7 HouseStyle 1460 non-null object 8 OverallQual 1460 non-null int64 9 OverallCond 1460 non-null int64 10 RoofStyle 1460 non-null object 11 Exterior1st 1460 non-null object 12 Exterior2nd 1460 non-null object 13 MasVnrType 1460 non-null object 14 MasVnrArea 1460 non-null float64 15 ExterQual 1460 non-null object 16 Foundation 1460 non-null object 17 BsmtQual 1460 non-null object 18 BsmtExposure 1460 non-null object 19 BsmtFinType1 1460 non-null object 20 BsmtFinSF1 1460 non-null int64 21 BsmtUnfSF 1460 non-null int64 22 TotalBsmtSF 1460 non-null int64 23 HeatingQC 1460 non-null object 24 1stFlrSF 1460 non-null int64 25 2ndFlrSF 1460 non-null int64 26 GrLivArea 1460 non-null int64 27 BsmtFullBath 1460 non-null int64 28 FullBath 1460 non-null int64 29 HalfBath 1460 non-null int64 30 BedroomAbvGr 1460 non-null int64 31 KitchenQual 1460 non-null object 32 TotRmsAbvGrd 1460 non-null int64 33 Fireplaces 1460 non-null int64 34 FireplaceQu 1460 non-null object 35 GarageType 1460 non-null object 36 GarageFinish 1460 non-null object 37 GarageCars 1460 non-null int64 38 GarageArea 1460 non-null int64 39 WoodDeckSF 1460 non-null int64 40 OpenPorchSF 1460 non-null int64 41 MoSold 1460 non-null int64 42 SalePrice 1460 non-null int64 43 age 1460 non-null int64 44 age_Remod 1460 non-null int64 45 age_Garage 1460 non-null float64 dtypes: float64(3), int64(24), object(19) memory usage: 524.8+ KB
#Display categorical variables list
cat_col_list
['MSSubClass', 'MSZoning', 'LotShape', 'LotConfig', 'Neighborhood', 'HouseStyle', 'RoofStyle', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'Foundation', 'BsmtQual', 'BsmtExposure', 'BsmtFinType1', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageType', 'GarageFinish', 'OverallQual', 'OverallCond', 'MoSold']
#Drop the categorical variables from the Data_train(maintable)
Data_train = Data_train.drop(cat_col_list,axis = 1)
#create table for all dummy variables (categorical variables)
cat_concat = pd.concat([MSSubClass, MSZoning, LotShape, LotConfig, Neighborhood, HouseStyle, RoofStyle, Exterior1st,
Exterior2nd, MasVnrType, ExterQual, Foundation, BsmtQual, BsmtExposure, BsmtFinType1, HeatingQC, KitchenQual,
FireplaceQu, GarageType, GarageFinish, OverallQual, OverallCond, MoSold], axis = 1)
#concat dummy variables table to the main table
Data_train = pd.concat([Data_train,cat_concat], axis = 1)
Data_train.shape
(1460, 188)
After deletion of unwanted columns and creating dummy variables there are 188 columns with 1460 rows in the Data_train table
cor = Data_train.corr()
#plotting correlation heatmap
#plt.figure(figsize=(50,20))
#sns.heatmap(cor, cmap="YlGnBu", annot = True)
#plt.show()
cor
| LotFrontage | LotArea | MasVnrArea | BsmtFinSF1 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | GrLivArea | BsmtFullBath | ... | MoSold_3 | MoSold_4 | MoSold_5 | MoSold_6 | MoSold_7 | MoSold_8 | MoSold_9 | MoSold_10 | MoSold_11 | MoSold_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LotFrontage | 1.000000 | 0.304522 | 0.178469 | 0.214367 | 0.124098 | 0.363472 | 0.413773 | 0.072388 | 0.368007 | 0.090343 | ... | 0.015397 | -0.068741 | -0.008092 | 0.003820 | 0.014998 | -0.036235 | 0.042033 | -0.027674 | 0.060188 | -0.010418 |
| LotArea | 0.304522 | 1.000000 | 0.103321 | 0.214103 | -0.002618 | 0.260833 | 0.299475 | 0.050986 | 0.263116 | 0.158155 | ... | 0.002208 | -0.032536 | -0.042141 | 0.064209 | -0.018250 | 0.019117 | 0.004245 | -0.032004 | 0.016914 | -0.001071 |
| MasVnrArea | 0.178469 | 0.103321 | 1.000000 | 0.261256 | 0.113862 | 0.360067 | 0.339850 | 0.173800 | 0.388052 | 0.083010 | ... | 0.033699 | -0.043721 | -0.053327 | 0.024093 | -0.002133 | 0.009243 | -0.002265 | 0.004953 | 0.013301 | -0.010876 |
| BsmtFinSF1 | 0.214367 | 0.214103 | 0.261256 | 1.000000 | -0.495251 | 0.522396 | 0.445863 | -0.137079 | 0.208171 | 0.649212 | ... | 0.008234 | 0.003818 | -0.020406 | -0.024166 | -0.035538 | -0.009820 | 0.006186 | 0.012646 | 0.045063 | -0.009999 |
| BsmtUnfSF | 0.124098 | -0.002618 | 0.113862 | -0.495251 | 1.000000 | 0.415360 | 0.317987 | 0.004469 | 0.240257 | -0.422900 | ... | 0.017249 | -0.023770 | -0.028910 | -0.021212 | 0.016779 | 0.016158 | 0.022900 | 0.016096 | -0.004482 | 0.026173 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| MoSold_8 | -0.036235 | 0.019117 | 0.009243 | -0.009820 | 0.016158 | 0.004755 | 0.010161 | 0.023561 | 0.026972 | 0.010059 | ... | -0.084488 | -0.098728 | -0.121695 | -0.138248 | -0.131921 | 1.000000 | -0.064124 | -0.076936 | -0.072222 | -0.061967 |
| MoSold_9 | 0.042033 | 0.004245 | -0.002265 | 0.006186 | 0.022900 | 0.026563 | 0.027410 | 0.030732 | 0.050365 | -0.031169 | ... | -0.059418 | -0.069432 | -0.085584 | -0.097225 | -0.092776 | -0.064124 | 1.000000 | -0.054106 | -0.050791 | -0.043579 |
| MoSold_10 | -0.027674 | -0.032004 | 0.004953 | 0.012646 | 0.016096 | 0.027637 | 0.038300 | -0.045281 | -0.009479 | -0.004721 | ... | -0.071289 | -0.083304 | -0.102683 | -0.116650 | -0.111311 | -0.076936 | -0.054106 | 1.000000 | -0.060939 | -0.052286 |
| MoSold_11 | 0.060188 | 0.016914 | 0.013301 | 0.045063 | -0.004482 | 0.042519 | 0.045191 | 0.018278 | 0.045770 | 0.043177 | ... | -0.066921 | -0.078199 | -0.096391 | -0.109502 | -0.104491 | -0.072222 | -0.050791 | -0.060939 | 1.000000 | -0.049082 |
| MoSold_12 | -0.010418 | -0.001071 | -0.010876 | -0.009999 | 0.026173 | 0.005557 | 0.007704 | -0.006716 | -0.000837 | -0.020754 | ... | -0.057418 | -0.067096 | -0.082704 | -0.093954 | -0.089654 | -0.061967 | -0.043579 | -0.052286 | -0.049082 | 1.000000 |
188 rows × 188 columns
cor["SalePrice"].sort_values(ascending=False)
SalePrice 1.000000
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
...
FireplaceQu_NO -0.471908
age_Remod -0.509079
KitchenQual_TA -0.519298
age -0.523350
ExterQual_TA -0.589044
Name: SalePrice, Length: 188, dtype: float64
#import sklearn for splittling
from sklearn.model_selection import train_test_split
np.random.seed(0)
#Data split into 70:30 ratio of train and test set
df_train, df_test = train_test_split(Data_train, train_size = 0.7, test_size = 0.3, random_state=100 )
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
num_val_col = ['LotFrontage','LotArea','MasVnrArea','BsmtFinSF1','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','GrLivArea',
'BsmtFullBath','FullBath','HalfBath','BedroomAbvGr','TotRmsAbvGrd','Fireplaces','GarageCars','GarageArea',
'WoodDeckSF','OpenPorchSF','age','age_Remod','age_Garage','SalePrice']
df_train[num_val_col] = scaler.fit_transform(df_train[num_val_col])
df_train.describe()
| LotFrontage | LotArea | MasVnrArea | BsmtFinSF1 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | GrLivArea | BsmtFullBath | ... | MoSold_3 | MoSold_4 | MoSold_5 | MoSold_6 | MoSold_7 | MoSold_8 | MoSold_9 | MoSold_10 | MoSold_11 | MoSold_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | ... | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 | 1021.000000 |
| mean | 0.166249 | 0.042143 | 0.065306 | 0.079337 | 0.241388 | 0.173773 | 0.184341 | 0.165943 | 0.207345 | 0.146588 | ... | 0.070519 | 0.110676 | 0.142997 | 0.169442 | 0.147894 | 0.078355 | 0.041136 | 0.061704 | 0.057786 | 0.042116 |
| std | 0.075615 | 0.048226 | 0.117088 | 0.082377 | 0.192066 | 0.075145 | 0.092132 | 0.210799 | 0.102232 | 0.175127 | ... | 0.256145 | 0.313884 | 0.350241 | 0.375325 | 0.355169 | 0.268860 | 0.198702 | 0.240735 | 0.233454 | 0.200951 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.133562 | 0.027923 | 0.000000 | 0.000000 | 0.092466 | 0.129787 | 0.116435 | 0.000000 | 0.133743 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.164384 | 0.037531 | 0.000000 | 0.069454 | 0.197774 | 0.162357 | 0.165278 | 0.000000 | 0.197540 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.195205 | 0.046943 | 0.098750 | 0.126152 | 0.345034 | 0.215057 | 0.243056 | 0.352058 | 0.255573 | 0.333333 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 188 columns
#divide X and y variables
y_train = df_train.pop('SalePrice')
X_train = df_train
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
#Running RFE with the output features = 40
lm = LinearRegression()
lm.fit(X_train,y_train)
rfe = RFE(lm, n_features_to_select = 40)
rfe = rfe.fit(X_train,y_train)
list(zip(X_train.columns, rfe.support_,rfe.ranking_))
[('LotFrontage', True, 1),
('LotArea', True, 1),
('MasVnrArea', False, 70),
('BsmtFinSF1', False, 37),
('BsmtUnfSF', False, 130),
('TotalBsmtSF', False, 38),
('1stFlrSF', True, 1),
('2ndFlrSF', True, 1),
('GrLivArea', False, 77),
('BsmtFullBath', True, 1),
('FullBath', False, 15),
('HalfBath', False, 92),
('BedroomAbvGr', False, 114),
('TotRmsAbvGrd', False, 112),
('Fireplaces', False, 120),
('GarageCars', True, 1),
('GarageArea', True, 1),
('WoodDeckSF', False, 54),
('OpenPorchSF', False, 91),
('age', True, 1),
('age_Remod', False, 45),
('age_Garage', False, 121),
('MSSubClass_30', False, 107),
('MSSubClass_40', False, 97),
('MSSubClass_45', False, 102),
('MSSubClass_50', False, 80),
('MSSubClass_60', False, 147),
('MSSubClass_70', False, 76),
('MSSubClass_75', False, 20),
('MSSubClass_80', False, 144),
('MSSubClass_85', False, 59),
('MSSubClass_90', True, 1),
('MSSubClass_120', False, 2),
('MSSubClass_160', True, 1),
('MSSubClass_180', False, 19),
('MSSubClass_190', False, 43),
('MSZoning_FV', False, 29),
('MSZoning_RH', False, 26),
('MSZoning_RL', False, 27),
('MSZoning_RM', False, 28),
('LotShape_IR2', False, 119),
('LotShape_IR3', True, 1),
('LotShape_Reg', False, 122),
('LotConfig_CulDSac', False, 65),
('LotConfig_FR2', False, 23),
('LotConfig_FR3', False, 79),
('LotConfig_Inside', False, 142),
('Neighborhood_Blueste', False, 88),
('Neighborhood_BrDale', False, 46),
('Neighborhood_BrkSide', False, 100),
('Neighborhood_ClearCr', False, 47),
('Neighborhood_CollgCr', False, 62),
('Neighborhood_Crawfor', True, 1),
('Neighborhood_Edwards', False, 22),
('Neighborhood_Gilbert', False, 61),
('Neighborhood_IDOTRR', False, 118),
('Neighborhood_MeadowV', False, 104),
('Neighborhood_Mitchel', False, 35),
('Neighborhood_NAmes', False, 140),
('Neighborhood_NPkVill', False, 58),
('Neighborhood_NWAmes', False, 141),
('Neighborhood_NoRidge', True, 1),
('Neighborhood_NridgHt', True, 1),
('Neighborhood_OldTown', False, 31),
('Neighborhood_SWISU', False, 103),
('Neighborhood_Sawyer', False, 136),
('Neighborhood_SawyerW', False, 63),
('Neighborhood_Somerst', True, 1),
('Neighborhood_StoneBr', True, 1),
('Neighborhood_Timber', False, 106),
('Neighborhood_Veenker', False, 73),
('HouseStyle_1.5Unf', False, 48),
('HouseStyle_1Story', False, 25),
('HouseStyle_2.5Fin', False, 18),
('HouseStyle_2.5Unf', False, 16),
('HouseStyle_2Story', False, 30),
('HouseStyle_SFoyer', False, 60),
('HouseStyle_SLvl', False, 82),
('RoofStyle_Gable', False, 74),
('RoofStyle_Gambrel', False, 24),
('RoofStyle_Hip', False, 72),
('RoofStyle_Mansard', False, 39),
('RoofStyle_Shed', False, 69),
('Exterior1st_AsphShn', False, 14),
('Exterior1st_BrkComm', False, 89),
('Exterior1st_BrkFace', True, 1),
('Exterior1st_CBlock', False, 6),
('Exterior1st_CemntBd', False, 49),
('Exterior1st_HdBoard', False, 71),
('Exterior1st_ImStucc', True, 1),
('Exterior1st_MetalSd', False, 85),
('Exterior1st_Plywood', False, 134),
('Exterior1st_Stone', False, 84),
('Exterior1st_Stucco', False, 87),
('Exterior1st_VinylSd', False, 105),
('Exterior1st_Wd Sdng', False, 127),
('Exterior1st_WdShing', False, 131),
('Exterior2nd_AsphShn', False, 11),
('Exterior2nd_Brk Cmn', False, 148),
('Exterior2nd_BrkFace', False, 51),
('Exterior2nd_CBlock', False, 12),
('Exterior2nd_CmentBd', False, 50),
('Exterior2nd_HdBoard', False, 75),
('Exterior2nd_ImStucc', True, 1),
('Exterior2nd_MetalSd', False, 86),
('Exterior2nd_Other', False, 17),
('Exterior2nd_Plywood', False, 135),
('Exterior2nd_Stone', False, 123),
('Exterior2nd_Stucco', False, 21),
('Exterior2nd_VinylSd', False, 99),
('Exterior2nd_Wd Sdng', False, 139),
('Exterior2nd_Wd Shng', False, 42),
('MasVnrType_BrkFace', False, 66),
('MasVnrType_NO', False, 55),
('MasVnrType_None', False, 67),
('MasVnrType_Stone', False, 68),
('ExterQual_Fa', False, 116),
('ExterQual_Gd', False, 78),
('ExterQual_TA', False, 81),
('Foundation_CBlock', False, 101),
('Foundation_PConc', False, 111),
('Foundation_Slab', False, 145),
('Foundation_Stone', False, 41),
('Foundation_Wood', False, 57),
('BsmtQual_Fa', True, 1),
('BsmtQual_Gd', True, 1),
('BsmtQual_NO', True, 1),
('BsmtQual_TA', True, 1),
('BsmtExposure_Gd', True, 1),
('BsmtExposure_Mn', False, 108),
('BsmtExposure_NO', False, 52),
('BsmtExposure_No', False, 56),
('BsmtFinType1_BLQ', False, 126),
('BsmtFinType1_GLQ', False, 98),
('BsmtFinType1_LwQ', False, 64),
('BsmtFinType1_NO', True, 1),
('BsmtFinType1_Rec', False, 146),
('BsmtFinType1_Unf', False, 40),
('HeatingQC_Fa', False, 143),
('HeatingQC_Gd', False, 124),
('HeatingQC_Po', True, 1),
('HeatingQC_TA', False, 137),
('KitchenQual_Fa', True, 1),
('KitchenQual_Gd', True, 1),
('KitchenQual_TA', True, 1),
('FireplaceQu_Fa', False, 113),
('FireplaceQu_Gd', False, 110),
('FireplaceQu_NO', False, 83),
('FireplaceQu_Po', False, 94),
('FireplaceQu_TA', False, 109),
('GarageType_Attchd', False, 33),
('GarageType_Basment', False, 32),
('GarageType_BuiltIn', False, 36),
('GarageType_CarPort', False, 44),
('GarageType_Detchd', False, 34),
('GarageType_NO', False, 10),
('GarageFinish_NO', False, 4),
('GarageFinish_RFn', False, 96),
('GarageFinish_Unf', False, 90),
('OverallQual_2', False, 13),
('OverallQual_3', False, 8),
('OverallQual_4', False, 9),
('OverallQual_5', False, 7),
('OverallQual_6', False, 5),
('OverallQual_7', False, 3),
('OverallQual_8', True, 1),
('OverallQual_9', True, 1),
('OverallQual_10', True, 1),
('OverallCond_2', True, 1),
('OverallCond_3', True, 1),
('OverallCond_4', True, 1),
('OverallCond_5', True, 1),
('OverallCond_6', True, 1),
('OverallCond_7', True, 1),
('OverallCond_8', True, 1),
('OverallCond_9', True, 1),
('MoSold_2', False, 138),
('MoSold_3', False, 129),
('MoSold_4', False, 132),
('MoSold_5', False, 115),
('MoSold_6', False, 133),
('MoSold_7', False, 95),
('MoSold_8', False, 93),
('MoSold_9', False, 117),
('MoSold_10', False, 53),
('MoSold_11', False, 128),
('MoSold_12', False, 125)]
#selected list
col_list = X_train.columns[rfe.support_]
col_list
Index(['LotFrontage', 'LotArea', '1stFlrSF', '2ndFlrSF', 'BsmtFullBath',
'GarageCars', 'GarageArea', 'age', 'MSSubClass_90', 'MSSubClass_160',
'LotShape_IR3', 'Neighborhood_Crawfor', 'Neighborhood_NoRidge',
'Neighborhood_NridgHt', 'Neighborhood_Somerst', 'Neighborhood_StoneBr',
'Exterior1st_BrkFace', 'Exterior1st_ImStucc', 'Exterior2nd_ImStucc',
'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_NO', 'BsmtQual_TA',
'BsmtExposure_Gd', 'BsmtFinType1_NO', 'HeatingQC_Po', 'KitchenQual_Fa',
'KitchenQual_Gd', 'KitchenQual_TA', 'OverallQual_8', 'OverallQual_9',
'OverallQual_10', 'OverallCond_2', 'OverallCond_3', 'OverallCond_4',
'OverallCond_5', 'OverallCond_6', 'OverallCond_7', 'OverallCond_8',
'OverallCond_9'],
dtype='object')
X_train_rfe = X_train[col_list]
import statsmodels.api as sm
#creating model
X_train_rfe = sm.add_constant(X_train_rfe)
#create fitting model
lm = sm.OLS(y_train, X_train_rfe).fit()
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.863
Model: OLS Adj. R-squared: 0.857
Method: Least Squares F-statistic: 157.9
Date: Sun, 22 Oct 2023 Prob (F-statistic): 0.00
Time: 16:14:10 Log-Likelihood: 1820.0
No. Observations: 1021 AIC: -3560.
Df Residuals: 981 BIC: -3363.
Df Model: 39
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 0.1048 0.045 2.331 0.020 0.017 0.193
LotFrontage -0.0397 0.022 -1.792 0.073 -0.083 0.004
LotArea 0.1724 0.032 5.398 0.000 0.110 0.235
1stFlrSF 0.3403 0.022 15.550 0.000 0.297 0.383
2ndFlrSF 0.1479 0.008 18.826 0.000 0.133 0.163
BsmtFullBath 0.0331 0.008 3.996 0.000 0.017 0.049
GarageCars 0.0896 0.017 5.376 0.000 0.057 0.122
GarageArea -0.0457 0.021 -2.149 0.032 -0.087 -0.004
age -0.1200 0.011 -10.861 0.000 -0.142 -0.098
MSSubClass_90 -0.0350 0.007 -4.691 0.000 -0.050 -0.020
MSSubClass_160 -0.0486 0.008 -6.413 0.000 -0.063 -0.034
LotShape_IR3 -0.0467 0.015 -3.085 0.002 -0.076 -0.017
Neighborhood_Crawfor 0.0488 0.008 6.239 0.000 0.033 0.064
Neighborhood_NoRidge 0.0806 0.009 9.404 0.000 0.064 0.097
Neighborhood_NridgHt 0.0428 0.007 5.983 0.000 0.029 0.057
Neighborhood_Somerst 0.0386 0.007 5.879 0.000 0.026 0.051
Neighborhood_StoneBr 0.0349 0.013 2.770 0.006 0.010 0.060
Exterior1st_BrkFace 0.0351 0.008 4.312 0.000 0.019 0.051
Exterior1st_ImStucc -0.0748 0.045 -1.654 0.099 -0.164 0.014
Exterior2nd_ImStucc 0.0386 0.016 2.406 0.016 0.007 0.070
BsmtQual_Fa -0.0395 0.012 -3.308 0.001 -0.063 -0.016
BsmtQual_Gd -0.0386 0.006 -6.037 0.000 -0.051 -0.026
BsmtQual_NO -0.0322 0.006 -5.716 0.000 -0.043 -0.021
BsmtQual_TA -0.0413 0.008 -5.440 0.000 -0.056 -0.026
BsmtExposure_Gd 0.0309 0.005 6.099 0.000 0.021 0.041
BsmtFinType1_NO -0.0322 0.006 -5.716 0.000 -0.043 -0.021
HeatingQC_Po -0.0341 0.042 -0.815 0.416 -0.116 0.048
KitchenQual_Fa -0.0365 0.012 -3.143 0.002 -0.059 -0.014
KitchenQual_Gd -0.0271 0.007 -3.713 0.000 -0.041 -0.013
KitchenQual_TA -0.0380 0.008 -4.783 0.000 -0.054 -0.022
OverallQual_8 0.0351 0.005 6.558 0.000 0.025 0.046
OverallQual_9 0.0795 0.011 6.989 0.000 0.057 0.102
OverallQual_10 0.0931 0.015 6.270 0.000 0.064 0.122
OverallCond_2 0.0580 0.050 1.163 0.245 -0.040 0.156
OverallCond_3 0.0329 0.045 0.734 0.463 -0.055 0.121
OverallCond_4 0.0481 0.044 1.088 0.277 -0.039 0.135
OverallCond_5 0.0552 0.044 1.263 0.207 -0.031 0.141
OverallCond_6 0.0730 0.044 1.668 0.096 -0.013 0.159
OverallCond_7 0.0787 0.044 1.801 0.072 -0.007 0.164
OverallCond_8 0.0854 0.044 1.948 0.052 -0.001 0.171
OverallCond_9 0.1198 0.046 2.626 0.009 0.030 0.209
==============================================================================
Omnibus: 712.444 Durbin-Watson: 1.961
Prob(Omnibus): 0.000 Jarque-Bera (JB): 87280.483
Skew: -2.304 Prob(JB): 0.00
Kurtosis: 48.060 Cond. No. 1.27e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.73e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
#residual analysis of the training data
#predict the value y_train_pred
y_train_pred = lm.predict(X_train_rfe)
# plot histogram for error term
fig = plt.figure()
sns.distplot((y_train - y_train_pred), bins = 20)
fig.suptitle("Error term" , fontsize = 20)
plt.xlabel("error",fontsize = 10)
Text(0.5, 0, 'error')
from sklearn.metrics import r2_score
r2_score(y_train,y_train_pred)
0.8625557437680991
### Making Prediction
col_list = X_train_rfe.columns
col_list
Index(['const', 'LotFrontage', 'LotArea', '1stFlrSF', '2ndFlrSF',
'BsmtFullBath', 'GarageCars', 'GarageArea', 'age', 'MSSubClass_90',
'MSSubClass_160', 'LotShape_IR3', 'Neighborhood_Crawfor',
'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Somerst',
'Neighborhood_StoneBr', 'Exterior1st_BrkFace', 'Exterior1st_ImStucc',
'Exterior2nd_ImStucc', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_NO',
'BsmtQual_TA', 'BsmtExposure_Gd', 'BsmtFinType1_NO', 'HeatingQC_Po',
'KitchenQual_Fa', 'KitchenQual_Gd', 'KitchenQual_TA', 'OverallQual_8',
'OverallQual_9', 'OverallQual_10', 'OverallCond_2', 'OverallCond_3',
'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7',
'OverallCond_8', 'OverallCond_9'],
dtype='object')
col_list = ['LotFrontage', 'LotArea', '1stFlrSF', '2ndFlrSF',
'BsmtFullBath', 'GarageCars', 'GarageArea', 'age', 'MSSubClass_90',
'MSSubClass_160', 'LotShape_IR3', 'Neighborhood_Crawfor',
'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Somerst',
'Neighborhood_StoneBr', 'Exterior1st_BrkFace', 'Exterior1st_ImStucc',
'Exterior2nd_ImStucc', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_NO',
'BsmtQual_TA', 'BsmtExposure_Gd', 'BsmtFinType1_NO', 'HeatingQC_Po',
'KitchenQual_Fa', 'KitchenQual_Gd', 'KitchenQual_TA', 'OverallQual_8',
'OverallQual_9', 'OverallQual_10', 'OverallCond_2', 'OverallCond_3',
'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7',
'OverallCond_8', 'OverallCond_9']
#apply scalar for test data for numeric variables
df_test[num_val_col] = scaler.transform(df_test[num_val_col])
#Divide x and y variables
y_test = df_test.pop('SalePrice')
X_test = df_test[col_list]
#add constant
X_test = sm.add_constant(X_test)
#make prediction
y_pred = lm.predict(X_test)
#model evaluation
fig = plt.figure()
plt.scatter(y_test,y_pred)
sns.regplot(x=y_test,y=y_pred,ci=None,color = 'blue')
fig.suptitle('y_test vs y_pred', fontsize = 20)
plt.xlabel('y_test',fontsize = 10)
plt.ylabel('y_pred',fontsize = 10)
plt.show()
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)
0.8592142005003942
print("Training Data")
rss = np.sum(np.square(y_train - y_train_pred))
print(rss)
mse = mean_squared_error(y_train,y_train_pred)
print(mse)
rmse = mse**0.5
print(rmse)
Training Data 1.6912554842430993 0.0016564696221773745 0.040699749657428785
print("Testing Data")
rss = np.sum(np.square(y_test - y_pred))
print(rss)
mse = mean_squared_error(y_test,y_pred)
print(mse)
rmse = mse**0.5
print(rmse)
Testing Data 0.765286435658425 0.0017472293051562216 0.0417998720710509
Linear Regression(RFE):with 40 features:
Training Data:
r2 score:0.8625557437680991
RMSE :0.040699749657428785,
Testing Data:
r2 score:0.8592142005003942
RMSE :0.0417998720710509
col_x = X_train_rfe.columns
X_train_rfe.pop('const')
X_test.pop('const')
X = X_train_rfe
col_x
Index(['const', 'LotFrontage', 'LotArea', '1stFlrSF', '2ndFlrSF',
'BsmtFullBath', 'GarageCars', 'GarageArea', 'age', 'MSSubClass_90',
'MSSubClass_160', 'LotShape_IR3', 'Neighborhood_Crawfor',
'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Somerst',
'Neighborhood_StoneBr', 'Exterior1st_BrkFace', 'Exterior1st_ImStucc',
'Exterior2nd_ImStucc', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_NO',
'BsmtQual_TA', 'BsmtExposure_Gd', 'BsmtFinType1_NO', 'HeatingQC_Po',
'KitchenQual_Fa', 'KitchenQual_Gd', 'KitchenQual_TA', 'OverallQual_8',
'OverallQual_9', 'OverallQual_10', 'OverallCond_2', 'OverallCond_3',
'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7',
'OverallCond_8', 'OverallCond_9'],
dtype='object')
X_test.columns
Index(['LotFrontage', 'LotArea', '1stFlrSF', '2ndFlrSF', 'BsmtFullBath',
'GarageCars', 'GarageArea', 'age', 'MSSubClass_90', 'MSSubClass_160',
'LotShape_IR3', 'Neighborhood_Crawfor', 'Neighborhood_NoRidge',
'Neighborhood_NridgHt', 'Neighborhood_Somerst', 'Neighborhood_StoneBr',
'Exterior1st_BrkFace', 'Exterior1st_ImStucc', 'Exterior2nd_ImStucc',
'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_NO', 'BsmtQual_TA',
'BsmtExposure_Gd', 'BsmtFinType1_NO', 'HeatingQC_Po', 'KitchenQual_Fa',
'KitchenQual_Gd', 'KitchenQual_TA', 'OverallQual_8', 'OverallQual_9',
'OverallQual_10', 'OverallCond_2', 'OverallCond_3', 'OverallCond_4',
'OverallCond_5', 'OverallCond_6', 'OverallCond_7', 'OverallCond_8',
'OverallCond_9'],
dtype='object')
feature_names = ['Const','LotFrontage', 'LotArea', '1stFlrSF', '2ndFlrSF',
'BsmtFullBath', 'GarageCars', 'GarageArea', 'age', 'MSSubClass_90',
'MSSubClass_160', 'LotShape_IR3', 'Neighborhood_Crawfor',
'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Somerst',
'Neighborhood_StoneBr', 'Exterior1st_BrkFace', 'Exterior1st_ImStucc',
'Exterior2nd_ImStucc', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_NO',
'BsmtQual_TA', 'BsmtExposure_Gd', 'BsmtFinType1_NO', 'HeatingQC_Po',
'KitchenQual_Fa', 'KitchenQual_Gd', 'KitchenQual_TA', 'OverallQual_8',
'OverallQual_9', 'OverallQual_10', 'OverallCond_2', 'OverallCond_3',
'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7',
'OverallCond_8', 'OverallCond_9']
#Applying Ridge Regression with varying the hyperparameter 'lamda or alpha'
X_seq = np.linspace(X.min(),X.max(),300).reshape(-1,1)
lambdas = [0,0.001,0.1,1,10,100,1000]
for i in lambdas:
degree = 1 #creating degree 1
ridgecoef = PolynomialFeatures(degree)
X_poly = ridgecoef.fit_transform(X)
ridgereg = Ridge(alpha = i)
ridgereg.fit(X_poly, y_train)
y_train_pred = ridgereg.predict(ridgecoef.fit_transform(X))
print("Lamda:"+str(i))
print("Training Data:")
print("r2score: "+str(r2_score(y_train,y_train_pred)))
print((mean_squared_error(y_train,y_train_pred))**0.5)
print(ridgereg.coef_)
#coefs = pd.DataFrame(ridgereg.coef_, columns=['coefficient importance'],index= feature_names)
#coefs.plot.barh(figsize=(9,7))
#plt.title("Ridge Model with regularization,Normalized variables")
#plt.xlabel("Raw Coefficient Values")
#plt.axvline(x=0, color=".5")
#plt.subplots_adjust(left = 0.3)
print("Testing Data:")
y_test_pred = ridgereg.predict(ridgecoef.transform(X_test))
print("r2score: "+str(r2_score(y_test,y_test_pred)))
print((mean_squared_error(y_test,y_test_pred))**0.5)
Lamda:0 Training Data: r2score: 0.8619396055155792 0.040790872533582725 [ 6.44852470e+12 -3.69143735e-02 1.76369091e-01 3.39225045e-01 1.46279261e-01 2.91588009e-02 8.80364262e-02 -4.59003217e-02 -1.20446248e-01 -3.33878022e-02 -4.91875635e-02 -4.27809520e-02 5.06139052e-02 8.24878251e-02 4.53352515e-02 4.01618481e-02 3.50785058e-02 3.44124504e-02 -5.92657651e-02 3.59000827e-02 -3.51860536e-02 -3.62059272e-02 -9.32745221e+12 -3.82330676e-02 3.20242060e-02 9.32745221e+12 -1.69832407e-02 -4.14153595e-02 -2.72682939e-02 -3.84479236e-02 3.60291254e-02 7.92398328e-02 9.54965477e-02 5.43711533e-02 3.05745975e-02 4.32014741e-02 5.13965021e-02 6.89578287e-02 7.49873500e-02 8.20393002e-02 1.15162119e-01] Testing Data: r2score: 0.8596462137565386 0.04173568955617854 Lamda:0.001 Training Data: r2score: 0.8625557120840566 0.040699754348539785 [ 0. -0.03966848 0.17233748 0.34020237 0.14791761 0.03306778 0.08958112 -0.04565655 -0.11999162 -0.03497005 -0.04855095 -0.0466545 0.04878863 0.080608 0.04278389 0.03856631 0.03490497 0.03509671 -0.07473961 0.03855483 -0.03955892 -0.03857712 -0.03220022 -0.04127469 0.03092726 -0.03220022 -0.03403994 -0.03654699 -0.02706356 -0.03795458 0.03513227 0.07947635 0.0930768 0.05739029 0.03228381 0.04745174 0.05462771 0.07239927 0.07809129 0.08481384 0.1192196 ] Testing Data: r2score: 0.8592177853257507 0.04179933989310499 Lamda:0.1 Training Data: r2score: 0.8624333275009848 0.04071787046752303 [ 0. -0.03567247 0.16432622 0.33196372 0.14662906 0.03290048 0.08737405 -0.04037823 -0.11864693 -0.03436661 -0.04831056 -0.04499798 0.04891015 0.08058766 0.04251978 0.0379643 0.03475098 0.03566593 -0.06661707 0.0369415 -0.04145843 -0.03865476 -0.03239821 -0.04165734 0.03101843 -0.03239821 -0.03119649 -0.03821795 -0.02727414 -0.03845574 0.03522291 0.07968144 0.09360243 0.02524068 0.00118712 0.01610647 0.02339936 0.04102592 0.0467178 0.05314446 0.08678717] Testing Data: r2score: 0.8592770504006628 0.04179054083499369 Lamda:1 Training Data: r2score: 0.859762601553011 0.041111219859311626 [ 0. -0.01071631 0.11697795 0.27397927 0.13653052 0.03273258 0.07422855 -0.01004483 -0.11047883 -0.03056786 -0.04669031 -0.03488483 0.04934448 0.08129877 0.04136794 0.03431765 0.03298502 0.03895924 -0.03299017 0.02846413 -0.04549205 -0.03883131 -0.03339034 -0.04375461 0.03373876 -0.03339034 -0.01804672 -0.04347657 -0.02904422 -0.0426561 0.03613361 0.07979844 0.09517718 -0.00251804 -0.02562573 -0.01230572 -0.00459928 0.01216628 0.01731552 0.02150638 0.05301748] Testing Data: r2score: 0.8551411805427173 0.042400209031549174 Lamda:10 Training Data: r2score: 0.8295922157550721 0.04531825250695942 [ 0. 0.02503816 0.03775006 0.11803305 0.09830245 0.03048779 0.06196138 0.04200687 -0.07760294 -0.02051864 -0.03886399 -0.00911499 0.04216289 0.07584232 0.04088754 0.02217908 0.02097213 0.03586492 -0.00476943 0.0132867 -0.04144571 -0.03515633 -0.03166916 -0.04571794 0.04181323 -0.03166916 -0.00432649 -0.04550773 -0.0303654 -0.05422257 0.04070925 0.06856388 0.07927737 -0.00127463 -0.0211237 -0.0149013 -0.00454551 0.00713752 0.00914809 0.00672629 0.02815025] Testing Data: r2score: 0.8139895651422212 0.04804676821599447 Lamda:100 Training Data: r2score: 0.6552720272099193 0.06445649752898806 [ 0.00000000e+00 1.37336791e-02 8.31164931e-03 3.01976970e-02 3.69682988e-02 1.65199815e-02 4.06666736e-02 3.37852043e-02 -3.13456987e-02 -1.01653214e-02 -1.65421848e-02 4.27582504e-04 1.41320090e-02 3.37746306e-02 3.13755598e-02 6.17480911e-03 5.81519056e-03 1.05193282e-02 -7.72285329e-05 3.33307518e-03 -1.16605277e-02 -1.29050191e-02 -1.45119057e-02 -3.49976375e-02 3.49243810e-02 -1.45119057e-02 -6.17798059e-04 -1.56524194e-02 -4.76457296e-03 -4.55625968e-02 3.67557181e-02 3.11260117e-02 2.86086525e-02 4.38356569e-04 -6.83561152e-03 -8.17878214e-03 8.65747564e-03 2.42892975e-03 7.46478731e-04 -1.80484138e-03 6.12596536e-03] Testing Data: r2score: 0.6245261613043251 0.068262999631284 Lamda:1000 Training Data: r2score: 0.2945861856824519 0.09220413975675877 [ 0.00000000e+00 2.50632832e-03 1.23685019e-03 5.18746878e-03 5.98086935e-03 3.52307897e-03 9.60489579e-03 7.86894738e-03 -8.54025798e-03 -2.12959555e-03 -2.61633602e-03 2.43599658e-04 1.78081184e-03 6.16459865e-03 7.81674630e-03 2.03541946e-03 1.21551077e-03 1.17775311e-03 6.44883113e-05 5.04051308e-04 -1.62013738e-03 4.36402276e-03 -2.45353175e-03 -1.46897905e-02 8.55707444e-03 -2.45353175e-03 -9.44000518e-05 -2.24398304e-03 7.88826523e-03 -1.77744257e-02 1.12039534e-02 5.68406367e-03 4.55592895e-03 6.87822758e-05 -1.23107072e-03 -2.14750341e-03 8.58419385e-03 -2.26782043e-03 -2.37630117e-03 -1.14305770e-03 6.76171683e-04] Testing Data: r2score: 0.2761089485623204 0.09478336672986282
Ridge Regression Model : choose lambda - 0.001
Lamda:0.001
Training Data:
r2score: 0.8625557120840566
RMSE:0.040699754348539785
Testing Data:
r2score: 0.8592177853257507
RMSE:0.04179933989310499
#Applying Ridge Regression with varying the hyperparameter 'lamda or alpha'
#X_seq = np.linspace(X.min(),X.max(),300).reshape(-1,1)
lambdas = [0,0.001,0.1,1,10,100,1000]
for i in lambdas:
degree = 1 #creating degree 1
lassocoef = PolynomialFeatures(degree)
X_poly = lassocoef.fit_transform(X)
lassoreg = Lasso(alpha = i)
lassoreg.fit(X_poly, y_train)
y_train_pred = lassoreg.predict(lassocoef.fit_transform(X))
print("Lamda:"+str(i))
print("Training Data:")
print("r2score: "+str(r2_score(y_train,y_train_pred)))
print((mean_squared_error(y_train,y_train_pred))**0.5)
print(lassoreg.coef_)
#feature_names = lassoreg[:-1].get_feature_names_out()
coefs = pd.DataFrame(lassoreg.coef_, columns=['coefficient importance'],index= feature_names)
coefs.plot.barh(figsize=(9,7))
plt.title("Lasso Model with regularization,Normalized variables")
plt.xlabel("Raw Coefficient Values")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left = 0.3)
print("Testing Data:")
y_test_pred = lassoreg.predict(lassocoef.transform(X_test))
print("r2score: "+str(r2_score(y_test,y_test_pred)))
print((mean_squared_error(y_test,y_test_pred))**0.5)
Lamda:0 Training Data: r2score: 0.8625543171707977 0.040699960877613454 [ 0. -0.03972659 0.17246816 0.34028563 0.14793653 0.03304235 0.08963967 -0.04567394 -0.11999319 -0.03496354 -0.04855494 -0.0466528 0.04878871 0.08059832 0.04278157 0.03856355 0.03490053 0.03508716 -0.074852 0.03858957 -0.03972599 -0.03858249 -0.04869784 -0.04128054 0.03087451 -0.01570742 -0.03408715 -0.03666121 -0.02705376 -0.03793425 0.03513468 0.07948404 0.09309599 0.05362754 0.02850073 0.04364186 0.05081577 0.06859487 0.07429948 0.08103838 0.11541116] Testing Data: r2score: 0.8592347052680516 0.041796827986071323 Lamda:0.001 Training Data: r2score: 0.8051467736454287 0.048459869554225314 [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 2.53522277e-01 1.13365650e-01 1.72547389e-02 8.36301313e-02 0.00000000e+00 -7.88405959e-02 -1.80007325e-02 -1.65886906e-02 -0.00000000e+00 2.38126133e-02 5.37312973e-02 3.37254495e-02 3.65546582e-05 0.00000000e+00 1.06937130e-03 -0.00000000e+00 0.00000000e+00 -0.00000000e+00 -5.29823070e-03 -1.24152233e-02 -1.00320301e-02 3.92180539e-02 -1.38784884e-03 -0.00000000e+00 -2.36613911e-03 -8.23402377e-04 -2.83275466e-02 4.71829180e-02 9.49843730e-02 9.73908048e-02 0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -7.93499508e-04 0.00000000e+00 8.05281640e-04 0.00000000e+00 0.00000000e+00] Testing Data: r2score: 0.7958201652667687 0.05033869480512303 Lamda:0.1 Training Data: r2score: 0.0 0.10978131792054956 [ 0. 0. 0. 0. 0. 0. 0. 0. -0. -0. -0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -0. 0. -0. -0. 0. -0. -0. -0. 0. -0. 0. 0. 0. 0. -0. -0. 0. -0. -0. -0. 0.] Testing Data: r2score: -0.00030262183303753076 0.11141950647109451 Lamda:1 Training Data: r2score: 0.0 0.10978131792054956 [ 0. 0. 0. 0. 0. 0. 0. 0. -0. -0. -0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -0. 0. -0. -0. 0. -0. -0. -0. 0. -0. 0. 0. 0. 0. -0. -0. 0. -0. -0. -0. 0.] Testing Data: r2score: -0.00030262183303753076 0.11141950647109451 Lamda:10 Training Data: r2score: 0.0 0.10978131792054956 [ 0. 0. 0. 0. 0. 0. 0. 0. -0. -0. -0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -0. 0. -0. -0. 0. -0. -0. -0. 0. -0. 0. 0. 0. 0. -0. -0. 0. -0. -0. -0. 0.] Testing Data: r2score: -0.00030262183303753076 0.11141950647109451 Lamda:100 Training Data: r2score: 0.0 0.10978131792054956 [ 0. 0. 0. 0. 0. 0. 0. 0. -0. -0. -0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -0. 0. -0. -0. 0. -0. -0. -0. 0. -0. 0. 0. 0. 0. -0. -0. 0. -0. -0. -0. 0.] Testing Data: r2score: -0.00030262183303753076 0.11141950647109451 Lamda:1000 Training Data: r2score: 0.0 0.10978131792054956 [ 0. 0. 0. 0. 0. 0. 0. 0. -0. -0. -0. 0. 0. 0. 0. 0. 0. 0. 0. 0. -0. 0. -0. -0. 0. -0. -0. -0. 0. -0. 0. 0. 0. 0. -0. -0. 0. -0. -0. -0. 0.] Testing Data: r2score: -0.00030262183303753076 0.11141950647109451
Lasso Regression Model: choose lambda = 0.001 - some more coeff values are zero
Lamda:0.001
Training Data:
r2score: 0.8051467736454287
RMSE :0.048459869554225314
Testing Data:
r2score: 0.7958201652667687
RMSE :0.05033869480512303
degree = 1 #creating degree 1
#lambda
i = 0.001
#Ridge
print("RIDGE")
ridgecoef = PolynomialFeatures(degree)
X_poly = ridgecoef.fit_transform(X)
ridgereg = Ridge(alpha = i)
ridgereg.fit(X_poly, y_train)
y_train_pred = ridgereg.predict(ridgecoef.fit_transform(X))
print("Lamda:"+str(i))
print("Training Data:")
print("r2score: "+str(r2_score(y_train,y_train_pred)))
print((mean_squared_error(y_train,y_train_pred))**0.5)
print(ridgereg.coef_)
print("Testing Data:")
y_test_pred = ridgereg.predict(ridgecoef.transform(X_test))
print("r2score: "+str(r2_score(y_test,y_test_pred)))
print((mean_squared_error(y_test,y_test_pred))**0.5)
#feature_names = lassoreg[:-1].get_feature_names_out()
coefs = pd.DataFrame(ridgereg.coef_, columns=['coefficient importance'],index= feature_names)
coefs.plot.barh(figsize=(9,7))
plt.title("Ridge Model with regularization,Normalized variables")
plt.xlabel("Raw Coefficient Values")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left = 0.3)
RIDGE Lamda:0.001 Training Data: r2score: 0.8625557120840566 0.040699754348539785 [ 0. -0.03966848 0.17233748 0.34020237 0.14791761 0.03306778 0.08958112 -0.04565655 -0.11999162 -0.03497005 -0.04855095 -0.0466545 0.04878863 0.080608 0.04278389 0.03856631 0.03490497 0.03509671 -0.07473961 0.03855483 -0.03955892 -0.03857712 -0.03220022 -0.04127469 0.03092726 -0.03220022 -0.03403994 -0.03654699 -0.02706356 -0.03795458 0.03513227 0.07947635 0.0930768 0.05739029 0.03228381 0.04745174 0.05462771 0.07239927 0.07809129 0.08481384 0.1192196 ] Testing Data: r2score: 0.8592177853257507 0.04179933989310499
Top 5 predictor variables:
1stFlrSF :0.34028563
LotArea :0.17233748
2ndFlrSF :0.14791761
age :-0.11999162
OverallCond_9 :0.1192196
rigde_top_col=['1stFlrSF','LotArea','2ndFlrSF','age','OverallCond_9']
#Lasso
print("LASSO")
lassocoef = PolynomialFeatures(degree)
X_poly = lassocoef.fit_transform(X)
lassoreg = Lasso(alpha = i)
lassoreg.fit(X_poly, y_train)
y_train_pred = lassoreg.predict(lassocoef.fit_transform(X))
print("Lamda:"+str(i))
print("Training Data:")
print("r2score: "+str(r2_score(y_train,y_train_pred)))
print((mean_squared_error(y_train,y_train_pred))**0.5)
print(lassoreg.coef_)
print("Testing Data:")
y_test_pred = lassoreg.predict(lassocoef.transform(X_test))
print("r2score: "+str(r2_score(y_test,y_test_pred)))
print((mean_squared_error(y_test,y_test_pred))**0.5)
#feature_names = lassoreg[:-1].get_feature_names_out()
coefs = pd.DataFrame(lassoreg.coef_, columns=['coefficient importance'],index= feature_names)
coefs.plot.barh(figsize=(9,7))
plt.title("Lasso Model with regularization,Normalized variables")
plt.xlabel("Raw Coefficient Values")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left = 0.3)
LASSO Lamda:0.001 Training Data: r2score: 0.8051467736454287 0.048459869554225314 [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 2.53522277e-01 1.13365650e-01 1.72547389e-02 8.36301313e-02 0.00000000e+00 -7.88405959e-02 -1.80007325e-02 -1.65886906e-02 -0.00000000e+00 2.38126133e-02 5.37312973e-02 3.37254495e-02 3.65546582e-05 0.00000000e+00 1.06937130e-03 -0.00000000e+00 0.00000000e+00 -0.00000000e+00 -5.29823070e-03 -1.24152233e-02 -1.00320301e-02 3.92180539e-02 -1.38784884e-03 -0.00000000e+00 -2.36613911e-03 -8.23402377e-04 -2.83275466e-02 4.71829180e-02 9.49843730e-02 9.73908048e-02 0.00000000e+00 -0.00000000e+00 -0.00000000e+00 -7.93499508e-04 0.00000000e+00 8.05281640e-04 0.00000000e+00 0.00000000e+00] Testing Data: r2score: 0.7958201652667687 0.05033869480512303
Top 5 predictor variables:
1stFlrSF :0.253522
2ndFlrSF :0.113366
OverallQual_10:0.097391
OverallQual_9 :0.094984
age :-0.078841
lasso_top_col=['1stFlrSF','2ndFlrSF','age','OverallQual_10','OverallQual_9']
betas = pd.DataFrame(index=feature_names , columns=['Ridge','Lasso'])
betas['Ridge'] = ridgereg.coef_
betas['Lasso'] = lassoreg.coef_
betas
| Ridge | Lasso | |
|---|---|---|
| Const | 0.000000 | 0.000000 |
| LotFrontage | -0.039668 | 0.000000 |
| LotArea | 0.172337 | 0.000000 |
| 1stFlrSF | 0.340202 | 0.253522 |
| 2ndFlrSF | 0.147918 | 0.113366 |
| BsmtFullBath | 0.033068 | 0.017255 |
| GarageCars | 0.089581 | 0.083630 |
| GarageArea | -0.045657 | 0.000000 |
| age | -0.119992 | -0.078841 |
| MSSubClass_90 | -0.034970 | -0.018001 |
| MSSubClass_160 | -0.048551 | -0.016589 |
| LotShape_IR3 | -0.046654 | -0.000000 |
| Neighborhood_Crawfor | 0.048789 | 0.023813 |
| Neighborhood_NoRidge | 0.080608 | 0.053731 |
| Neighborhood_NridgHt | 0.042784 | 0.033725 |
| Neighborhood_Somerst | 0.038566 | 0.000037 |
| Neighborhood_StoneBr | 0.034905 | 0.000000 |
| Exterior1st_BrkFace | 0.035097 | 0.001069 |
| Exterior1st_ImStucc | -0.074740 | -0.000000 |
| Exterior2nd_ImStucc | 0.038555 | 0.000000 |
| BsmtQual_Fa | -0.039559 | -0.000000 |
| BsmtQual_Gd | -0.038577 | -0.005298 |
| BsmtQual_NO | -0.032200 | -0.012415 |
| BsmtQual_TA | -0.041275 | -0.010032 |
| BsmtExposure_Gd | 0.030927 | 0.039218 |
| BsmtFinType1_NO | -0.032200 | -0.001388 |
| HeatingQC_Po | -0.034040 | -0.000000 |
| KitchenQual_Fa | -0.036547 | -0.002366 |
| KitchenQual_Gd | -0.027064 | -0.000823 |
| KitchenQual_TA | -0.037955 | -0.028328 |
| OverallQual_8 | 0.035132 | 0.047183 |
| OverallQual_9 | 0.079476 | 0.094984 |
| OverallQual_10 | 0.093077 | 0.097391 |
| OverallCond_2 | 0.057390 | 0.000000 |
| OverallCond_3 | 0.032284 | -0.000000 |
| OverallCond_4 | 0.047452 | -0.000000 |
| OverallCond_5 | 0.054628 | -0.000793 |
| OverallCond_6 | 0.072399 | 0.000000 |
| OverallCond_7 | 0.078091 | 0.000805 |
| OverallCond_8 | 0.084814 | 0.000000 |
| OverallCond_9 | 0.119220 | 0.000000 |
degree = 1 #creating degree 1
#lambda(double of 0.001)
i = 0.002
#Ridge
print("RIDGE")
ridgecoef = PolynomialFeatures(degree)
X_poly = ridgecoef.fit_transform(X)
ridgereg = Ridge(alpha = i)
ridgereg.fit(X_poly, y_train)
y_train_pred = ridgereg.predict(ridgecoef.fit_transform(X))
print("Lamda:"+str(i))
print("Training Data:")
print("r2score: "+str(r2_score(y_train,y_train_pred)))
print((mean_squared_error(y_train,y_train_pred))**0.5)
print(ridgereg.coef_)
print("Testing Data:")
y_test_pred = ridgereg.predict(ridgecoef.transform(X_test))
print("r2score: "+str(r2_score(y_test,y_test_pred)))
print((mean_squared_error(y_test,y_test_pred))**0.5)
#feature_names = lassoreg[:-1].get_feature_names_out()
coefs = pd.DataFrame(ridgereg.coef_, columns=['coefficient importance'],index= feature_names)
coefs.plot.barh(figsize=(9,7))
plt.title("Ridge Model with regularization,Normalized variables")
plt.xlabel("Raw Coefficient Values")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left = 0.3)
RIDGE Lamda:0.002 Training Data: r2score: 0.8625556188986986 0.04069976814547741 [ 0. -0.03962698 0.17225414 0.34011656 0.14790465 0.03306448 0.08955932 -0.04559761 -0.11997523 -0.03496315 -0.0485484 -0.04663587 0.04879004 0.0806069 0.04278049 0.03855931 0.03490373 0.03510248 -0.07464904 0.03853792 -0.03959073 -0.0385783 -0.0322026 -0.04127926 0.03092496 -0.0322026 -0.03400937 -0.03657283 -0.02706538 -0.0379591 0.03513324 0.07947953 0.09308406 0.05678541 0.03169856 0.04686344 0.0540413 0.07181111 0.07750371 0.08422346 0.11861651] Testing Data: r2score: 0.8592212655878217 0.04179882323140653
#Lasso
print("LASSO")
lassocoef = PolynomialFeatures(degree)
X_poly = lassocoef.fit_transform(X)
lassoreg = Lasso(alpha = i)
lassoreg.fit(X_poly, y_train)
y_train_pred = lassoreg.predict(lassocoef.fit_transform(X))
print("Lamda:"+str(i))
print("Training Data:")
print("r2score: "+str(r2_score(y_train,y_train_pred)))
print((mean_squared_error(y_train,y_train_pred))**0.5)
print(lassoreg.coef_)
print("Testing Data:")
y_test_pred = lassoreg.predict(lassocoef.transform(X_test))
print("r2score: "+str(r2_score(y_test,y_test_pred)))
print((mean_squared_error(y_test,y_test_pred))**0.5)
#feature_names = lassoreg[:-1].get_feature_names_out()
coefs = pd.DataFrame(lassoreg.coef_, columns=['coefficient importance'],index= feature_names)
coefs.plot.barh(figsize=(9,7))
plt.title("Lasso Model with regularization,Normalized variables")
plt.xlabel("Raw Coefficient Values")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left = 0.3)
LASSO Lamda:0.002 Training Data: r2score: 0.7293552227618919 0.05711211790440109 [ 0. 0. 0. 0.14941177 0.07693467 0. 0.09792412 0. -0.0485874 -0. -0. -0. 0. 0.0447109 0.03322644 0. 0. 0. -0. 0. -0. -0. -0. -0.01117937 0.04349696 -0. -0. -0. -0. -0.03837864 0.04685108 0.06982025 0.06010389 0. -0. -0. 0. 0. 0. 0. 0. ] Testing Data: r2score: 0.7138636351836265 0.0595911895166149
betas = pd.DataFrame(index=feature_names , columns=['Ridge','Lasso'])
betas['Ridge'] = ridgereg.coef_
betas['Lasso'] = lassoreg.coef_
betas
| Ridge | Lasso | |
|---|---|---|
| Const | 0.000000 | 0.000000 |
| LotFrontage | -0.039627 | 0.000000 |
| LotArea | 0.172254 | 0.000000 |
| 1stFlrSF | 0.340117 | 0.149412 |
| 2ndFlrSF | 0.147905 | 0.076935 |
| BsmtFullBath | 0.033064 | 0.000000 |
| GarageCars | 0.089559 | 0.097924 |
| GarageArea | -0.045598 | 0.000000 |
| age | -0.119975 | -0.048587 |
| MSSubClass_90 | -0.034963 | -0.000000 |
| MSSubClass_160 | -0.048548 | -0.000000 |
| LotShape_IR3 | -0.046636 | -0.000000 |
| Neighborhood_Crawfor | 0.048790 | 0.000000 |
| Neighborhood_NoRidge | 0.080607 | 0.044711 |
| Neighborhood_NridgHt | 0.042780 | 0.033226 |
| Neighborhood_Somerst | 0.038559 | 0.000000 |
| Neighborhood_StoneBr | 0.034904 | 0.000000 |
| Exterior1st_BrkFace | 0.035102 | 0.000000 |
| Exterior1st_ImStucc | -0.074649 | -0.000000 |
| Exterior2nd_ImStucc | 0.038538 | 0.000000 |
| BsmtQual_Fa | -0.039591 | -0.000000 |
| BsmtQual_Gd | -0.038578 | -0.000000 |
| BsmtQual_NO | -0.032203 | -0.000000 |
| BsmtQual_TA | -0.041279 | -0.011179 |
| BsmtExposure_Gd | 0.030925 | 0.043497 |
| BsmtFinType1_NO | -0.032203 | -0.000000 |
| HeatingQC_Po | -0.034009 | -0.000000 |
| KitchenQual_Fa | -0.036573 | -0.000000 |
| KitchenQual_Gd | -0.027065 | -0.000000 |
| KitchenQual_TA | -0.037959 | -0.038379 |
| OverallQual_8 | 0.035133 | 0.046851 |
| OverallQual_9 | 0.079480 | 0.069820 |
| OverallQual_10 | 0.093084 | 0.060104 |
| OverallCond_2 | 0.056785 | 0.000000 |
| OverallCond_3 | 0.031699 | -0.000000 |
| OverallCond_4 | 0.046863 | -0.000000 |
| OverallCond_5 | 0.054041 | 0.000000 |
| OverallCond_6 | 0.071811 | 0.000000 |
| OverallCond_7 | 0.077504 | 0.000000 |
| OverallCond_8 | 0.084223 | 0.000000 |
| OverallCond_9 | 0.118617 | 0.000000 |
X = X_train_rfe
degree = 1 #creating degree 1
#lambda
i = 0.001
X_ridge = X.drop(columns = rigde_top_col)
X_ridge_test = X_test.drop(columns = rigde_top_col)
X_ridge.shape
(1021, 35)
X_ridge.columns
Index(['LotFrontage', 'BsmtFullBath', 'GarageCars', 'GarageArea',
'MSSubClass_90', 'MSSubClass_160', 'LotShape_IR3',
'Neighborhood_Crawfor', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Exterior1st_BrkFace',
'Exterior1st_ImStucc', 'Exterior2nd_ImStucc', 'BsmtQual_Fa',
'BsmtQual_Gd', 'BsmtQual_NO', 'BsmtQual_TA', 'BsmtExposure_Gd',
'BsmtFinType1_NO', 'HeatingQC_Po', 'KitchenQual_Fa', 'KitchenQual_Gd',
'KitchenQual_TA', 'OverallQual_8', 'OverallQual_9', 'OverallQual_10',
'OverallCond_2', 'OverallCond_3', 'OverallCond_4', 'OverallCond_5',
'OverallCond_6', 'OverallCond_7', 'OverallCond_8'],
dtype='object')
feature_names_ridge = ['const','LotFrontage', 'BsmtFullBath', 'GarageCars', 'GarageArea',
'MSSubClass_90', 'MSSubClass_160', 'LotShape_IR3',
'Neighborhood_Crawfor', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Exterior1st_BrkFace',
'Exterior1st_ImStucc', 'Exterior2nd_ImStucc', 'BsmtQual_Fa',
'BsmtQual_Gd', 'BsmtQual_NO', 'BsmtQual_TA', 'BsmtExposure_Gd',
'BsmtFinType1_NO', 'HeatingQC_Po', 'KitchenQual_Fa', 'KitchenQual_Gd',
'KitchenQual_TA', 'OverallQual_8', 'OverallQual_9', 'OverallQual_10',
'OverallCond_2', 'OverallCond_3', 'OverallCond_4', 'OverallCond_5',
'OverallCond_6', 'OverallCond_7', 'OverallCond_8']
#Ridge
print("RIDGE")
ridgecoef = PolynomialFeatures(degree)
X_poly = ridgecoef.fit_transform(X_ridge)
ridgereg = Ridge(alpha = i)
ridgereg.fit(X_poly, y_train)
y_train_pred = ridgereg.predict(ridgecoef.fit_transform(X_ridge))
print("Lamda:"+str(i))
print("Training Data:")
print("r2score: "+str(r2_score(y_train,y_train_pred)))
print((mean_squared_error(y_train,y_train_pred))**0.5)
print(ridgereg.coef_)
print("Testing Data:")
y_test_pred = ridgereg.predict(ridgecoef.transform(X_ridge_test))
print("r2score: "+str(r2_score(y_test,y_test_pred)))
print((mean_squared_error(y_test,y_test_pred))**0.5)
#feature_names = lassoreg[:-1].get_feature_names_out()
coefs = pd.DataFrame(ridgereg.coef_, columns=['coefficient importance'],index= feature_names_ridge)
coefs.plot.barh(figsize=(9,7))
plt.title("Ridge Model with regularization,Normalized variables")
plt.xlabel("Raw Coefficient Values")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left = 0.3)
RIDGE Lamda:0.001 Training Data: r2score: 0.7834556415116392 0.051085997413457165 [ 0. 0.10931535 0.0233091 0.10752951 0.02686144 -0.01137716 -0.02587471 -0.01330696 0.05519398 0.11899084 0.0307999 0.0257183 0.02778048 0.04577266 -0.13923112 0.03508675 -0.09586822 -0.04420711 -0.05051227 -0.07487915 0.03175578 -0.05051227 -0.07534518 -0.07698398 -0.03718107 -0.06573777 0.04921761 0.1029065 0.16120272 -0.07531687 -0.0788448 -0.0637915 -0.04554613 -0.03322095 -0.03449167 -0.03751543] Testing Data: r2score: 0.7619842188882349 0.05434985670015881
X_lasso = X.drop(columns = lasso_top_col)
X_lasso_test = X_test.drop(columns = lasso_top_col)
X_lasso.columns
Index(['LotFrontage', 'LotArea', 'BsmtFullBath', 'GarageCars', 'GarageArea',
'MSSubClass_90', 'MSSubClass_160', 'LotShape_IR3',
'Neighborhood_Crawfor', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Exterior1st_BrkFace',
'Exterior1st_ImStucc', 'Exterior2nd_ImStucc', 'BsmtQual_Fa',
'BsmtQual_Gd', 'BsmtQual_NO', 'BsmtQual_TA', 'BsmtExposure_Gd',
'BsmtFinType1_NO', 'HeatingQC_Po', 'KitchenQual_Fa', 'KitchenQual_Gd',
'KitchenQual_TA', 'OverallQual_8', 'OverallCond_2', 'OverallCond_3',
'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7',
'OverallCond_8', 'OverallCond_9'],
dtype='object')
feature_names_lasso = ['const','LotFrontage', 'LotArea', 'BsmtFullBath', 'GarageCars', 'GarageArea',
'MSSubClass_90', 'MSSubClass_160', 'LotShape_IR3',
'Neighborhood_Crawfor', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
'Neighborhood_Somerst', 'Neighborhood_StoneBr', 'Exterior1st_BrkFace',
'Exterior1st_ImStucc', 'Exterior2nd_ImStucc', 'BsmtQual_Fa',
'BsmtQual_Gd', 'BsmtQual_NO', 'BsmtQual_TA', 'BsmtExposure_Gd',
'BsmtFinType1_NO', 'HeatingQC_Po', 'KitchenQual_Fa', 'KitchenQual_Gd',
'KitchenQual_TA', 'OverallQual_8', 'OverallCond_2', 'OverallCond_3',
'OverallCond_4', 'OverallCond_5', 'OverallCond_6', 'OverallCond_7',
'OverallCond_8', 'OverallCond_9']
X_lasso.shape
(1021, 35)
#Lasso
print("LASSO")
lassocoef = PolynomialFeatures(degree)
X_poly = lassocoef.fit_transform(X_lasso)
lassoreg = Lasso(alpha = i)
lassoreg.fit(X_poly, y_train)
y_train_pred = lassoreg.predict(lassocoef.fit_transform(X_lasso))
print("Lamda:"+str(i))
print("Training Data:")
print("r2score: "+str(r2_score(y_train,y_train_pred)))
print((mean_squared_error(y_train,y_train_pred))**0.5)
print(lassoreg.coef_)
print("Testing Data:")
y_test_pred = lassoreg.predict(lassocoef.transform(X_lasso_test))
print("r2score: "+str(r2_score(y_test,y_test_pred)))
print((mean_squared_error(y_test,y_test_pred))**0.5)
#feature_names = lassoreg[:-1].get_feature_names_out()
coefs = pd.DataFrame(lassoreg.coef_, columns=['coefficient importance'],index= feature_names_lasso )
coefs.plot.barh(figsize=(9,7))
plt.title("Lasso Model with regularization,Normalized variables")
plt.xlabel("Raw Coefficient Values")
plt.axvline(x=0, color=".5")
plt.subplots_adjust(left = 0.3)
LASSO Lamda:0.001 Training Data: r2score: 0.7047142382911329 0.05965539055961897 [ 0. 0. 0. 0.0024356 0.11020291 0.05692849 -0. -0.01258588 0. 0.02792744 0.10989839 0.04419875 0. 0. 0.00299286 -0. 0. -0.03937701 -0.03040612 -0.04516775 -0.05340777 0.05191409 -0.01408318 -0. -0.07265162 -0.0528089 -0.09570708 0.03388125 -0. -0. -0. 0. 0. 0. -0. 0. ] Testing Data: r2score: 0.668442642034726 0.06414678017773935
Top5 features:
GarageCars :0.11020291
Neighborhood_NoRidge :0.10989839
KitchenQual_TA :-0.09570708
KitchenQual_Fa : -0.072652
GarageArea :0.05692849
betas = pd.DataFrame(index=feature_names_lasso , columns=['Lasso'])
betas['Lasso'] = lassoreg.coef_
betas
| Lasso | |
|---|---|
| const | 0.000000 |
| LotFrontage | 0.000000 |
| LotArea | 0.000000 |
| BsmtFullBath | 0.002436 |
| GarageCars | 0.110203 |
| GarageArea | 0.056928 |
| MSSubClass_90 | -0.000000 |
| MSSubClass_160 | -0.012586 |
| LotShape_IR3 | 0.000000 |
| Neighborhood_Crawfor | 0.027927 |
| Neighborhood_NoRidge | 0.109898 |
| Neighborhood_NridgHt | 0.044199 |
| Neighborhood_Somerst | 0.000000 |
| Neighborhood_StoneBr | 0.000000 |
| Exterior1st_BrkFace | 0.002993 |
| Exterior1st_ImStucc | -0.000000 |
| Exterior2nd_ImStucc | 0.000000 |
| BsmtQual_Fa | -0.039377 |
| BsmtQual_Gd | -0.030406 |
| BsmtQual_NO | -0.045168 |
| BsmtQual_TA | -0.053408 |
| BsmtExposure_Gd | 0.051914 |
| BsmtFinType1_NO | -0.014083 |
| HeatingQC_Po | -0.000000 |
| KitchenQual_Fa | -0.072652 |
| KitchenQual_Gd | -0.052809 |
| KitchenQual_TA | -0.095707 |
| OverallQual_8 | 0.033881 |
| OverallCond_2 | -0.000000 |
| OverallCond_3 | -0.000000 |
| OverallCond_4 | -0.000000 |
| OverallCond_5 | 0.000000 |
| OverallCond_6 | 0.000000 |
| OverallCond_7 | 0.000000 |
| OverallCond_8 | -0.000000 |
| OverallCond_9 | 0.000000 |
Lambda / alpha :0.001
Training Data:
r2score: 0.8625557120840566
RMSE:0.040699754348539785
Testing Data:
r2score: 0.8592177853257507
RMSE:0.04179933989310499
Top 5 predictor variables:(independent variables)
1stFlrSF :0.34028563
LotArea :0.17233748
2ndFlrSF :0.14791761
age :-0.11999162
OverallCond_9 :0.1192196
Lambda / alpha :0.001
Training Data:
r2score: 0.8051467736454287
RMSE :0.048459869554225314
Testing Data:
r2score: 0.7958201652667687
RMSE :0.05033869480512303
Top 5 predictor variables:(independent variables)
1stFlrSF :0.253522
2ndFlrSF :0.113366
OverallQual_10:0.097391
OverallQual_9 :0.094984
age :-0.078841
Note: Target variable is normalized which can be inverted